Integrating a function using non-uniform measure (python/scipy) - python

I would like to integrate a function in python and provide the probability density (measure) used to sample values. If it's not obvious, integrating f(x)dx in [a,b] implicitly use the uniform probability density over [a,b], and I would like to use my own probability density (e.g. exponential).
I can do it myself, using np.random.* but then
I miss the optimizations available in scipy.integrate.quad. Or maybe all those optimizations assume the uniform density?
I need to do the error estimation myself, which is not trivial. Or maybe it is? Maybe the error is just the variance of sum(f(x))/n?
Any ideas?

As unutbu said, if you have the density function, the you can just integrate the product of your function with the pdf using scipy.integrate.quad.
For the distribution that are available in scipy.stats, we can also just use the expect function.
For example
>>> from scipy import stats
>>> f = lambda x: x**2
>>> stats.norm.expect(f, loc=0, scale=1)
1.0000000000000011
>>> stats.norm.expect(f, loc=0, scale=np.sqrt(2))
1.9999999999999996
scipy.integrate.quad also has some predefined weight functions, although they are not normalized to be probability density functions.
The approximation error depends on the settings for the call to integrate.quad.

Just for the sake of brevity, 3 ways were suggested for calculating the expected value of f(x) under the probability p(x):
Assuming p is given in closed-form, use scipy.integrate.quad to evaluate f(x)p(x)
Assuming p can be sampled from, sample N values x=P(N), then evaluate the expected value by np.mean(f(X)) and the error by np.std(f(X))/np.sqrt(N)
Assuming p is available at stats.norm, use stats.norm.expect(f)
Assuming we have the CDF(x) of the distribution rather than p(x), calculate H=Inverse[CDF] and then integrate f(H(x)) using scipy.integrate.quad

Another possibilty would be to integrate x -> f( H(x)) where H is the inverse of the cumulative distribution of your probability distribtion.
[This is because of change of variable: replacing y=CDF(x) and noting that p(x)=CDF'(x) yields the change dy=p(x)dx and thus int{f(x)p(x)dx}==int{f(x)dy}==int{f(H(y))dy with H the inverse of CDF.]

Related

How to obtain a python scipy-type continuous rv distribution object that is bounded?

I would like to define a bounded version of a continuous random variable distribution (say, an exponential, but I might want to use others as well). The bounds are 0 and 1. I would like to
draw random variates (as done by scipy.stats.rv_continuous.rvs),
use the ppf (percentage point function) (as done by scipy.stats.rv_continuous.ppf), and possibly
use the cdf (cumulative density function) (as done by scipy.stats.rv_continuous.cdf)
Possible approaches I can think of:
Getting random variates in an ad hoc way is not difficult
import scipy.stats
d = scipy.stats.expon(0, 3/10.) # an exponential distribution as an example
rv = d.rvs(size=target_number_of_rv)
rv = rv[0=<rv]
rv = rv[rv<=1]
while len(rv) < target_number_of_rv:
rv += d.rvs(1)
rv = rv[0=<rv]
rv = rv[rv<=1]
but 1) this is non-generic and potentially error-prone and 2) it does not help with the ppf or cdf.
Subclassing scipy.stats.rv_continuous, as is done here and here. Thereby, the ppf of scipy.stats.rv_continuous can be used. The drawback is that it requires the pdf (not just a pre-defined rv_continuous object or the pdf of the unbounded distribution and the bounds), and if this is wrong, cdf and ppf and everything else will be wrong as well.
Designing a class that cares for applying the bounds to the rv generation and for correcting the value of the ppf obtained from the unbounded object in scipy.stats. A drawback is that this is non-generic and error-prone as well and that it may be difficult to correct the ppf. My feeling is that the value of the cdf of the unbounded distribution could be scaled by what share of probability mass is out of the bounds (in total, lower and upper), but I may be wrong. That would be for lower and upper bounds l and u and any valid quantile x (with l<=x<=u): (cdf(x)-cdf(l))/(cdf(u)-cdf(l)). Obtaining the ppf would, however, require to invert the resulting function.
My feeling is that there might be a better and more generic way to do this. Is there? Maybe with sympy? Maybe by somehow obtaining the function object of the unbounded cdf and modifying it directly?
Python is version: 3.6.2, scipy is version 0.19.1.
If the distribution is one of those that is available in scipy.stats then you can evaluate its integral between the two bounds using the cdf for that distribution. Otherwise, you can define the pdf for rv_continuous and then use its cdf to get this integral.
Now, you have, in effect, the pdf for the bounded version of the pdf you want because you have calculated the normalising constant for it, in that integral. You can proceed to use rv_continuous with the form that you have for the pdf plus the normalising constant and with the bounds.
Here's what your code might be like. The variable scale is set according to the scipy documents. norm is the integral of the exponential pdf over [0,1]. Only about .49 of the probability mass is accounted for. Therefore, to make the exponential, when truncated to the [0,1] interval give a mass of one we must divide its pdf by this factor.
Truncated_expon is defined as a subclass of rv_continuous as in the documentation. By supplying its pdf we make it possible (at least for such a simple integral!) for scipy to calculate this distribution's cdf and thereby to calculate random samples.
I have calculated the cdf at one as a check.
>>> from scipy import stats
>>> lamda = 2/3
>>> scale = 1/lamda
>>> norm = stats.expon.cdf(1, scale=scale)
>>> norm
0.48658288096740798
>>> from math import exp
>>> class Truncated_expon(stats.rv_continuous):
... def _pdf(self, x, lamda):
... return lamda*exp(-lamda*x)/0.48658288096740798
...
>>> e = Truncated_expon(a=0, b=1, shapes='lamda')
>>> e.cdf(1, lamda=lamda)
1.0
>>> e.rvs(size=20, lamda=lamda)
array([ 0.20064067, 0.67646465, 0.89118679, 0.86093035, 0.14334989,
0.10505598, 0.53488779, 0.11606106, 0.41296616, 0.33650899,
0.95126415, 0.57481087, 0.04495104, 0.00308469, 0.23585195,
0.00653972, 0.59400395, 0.34919065, 0.91762547, 0.40098409])

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Calculate moments (mean, variance) of distribution in python

I have two arrays. x is the independent variable, and counts is the number of counts of x occurring, like a histogram. I know I can calculate the mean by defining a function:
def mean(x,counts):
return np.sum(x*counts) / np.sum(counts)
Is there a general function I can use to calculate each moment from the distribution defined by x and counts? I would also like to compute the variance.
You could use the moment function from scipy. It calculates the n-th central moment of your data.
You could also define your own function, which could look something like this:
def nmoment(x, counts, c, n):
return np.sum(counts*(x-c)**n) / np.sum(counts)
In that function, c is meant to be the point around which the moment is taken, and n is the order. So to get the variance you could do nmoment(x, counts, np.average(x, weights=counts), 2).
import scipy as sp
from scipy import stats
stats.moment(counts, moment = 2) #variance
stats.moment returns nth central moment.
Numpy supports order statistics now
https://numpy.org/doc/stable/reference/routines.statistics.html
np.average
np.std
np.var
etc

Log Normal Random Variables with Scipy

I fail to understand the very basics of creating lognormal variables as documented here.
The log normal distribution takes on mean and variance as parameters. I would like to create a frozen distribution using these parameters and then get cdf, pdf etc.
However, in the documentation, they get the frozen distribution using
from scipy.stats import lognorm
s = 0.953682269606
rv = lognorm(s)
's' seems to be the standard deviation. I tried to use the 'loc' and 'scale' parameters instead of 's', but that generated an error (s is a required parameter). How can I generate a frozen distribution with parameter values 'm', 's' for location and scale?
The mystery is solved (edit 3)
μ corresponds to ln(scale) (!)
σ corresponds to shape (s)
loc is not needed for setting any of σ and μ
I think it is a severe problem that this is not clearly documented. I guess many have fallen for this when doing simple tests with the lognormal distribution in SciPy.
Why is that?
The stats module treats loc and scale the same for all distributions (this is not explicitly written down, but can be inferred when reading between the lines). My suspicion was that loc is substracted from x, and the result is divided by scale (and the result is treated as the new x). I tested for that, and this turned out to be the case.
What does it mean for the lognormal distribution? In the canonical definition of the lognormal distribution the term ln(x) appears. Obviously, the same term appears in SciPy's implementation. With above's considerations, this is how loc and scale end up in the logarithm:
ln((x-loc)/scale)
By common logarithm calculus, this is the same as
ln(x-loc) - ln(scale)
In the canonical definition of the lognormal distribution the term simply is ln(x) - μ. Comparing SciPy's approach and the canonical approach then provides the crucial insight: ln(scale) represents μ. loc, however, has no correspondence in the canonical definition and is better left at 0. Further below, I have argued for the fact that shape (s) is σ.
Proof
>>> import math
>>> from scipy.stats import lognorm
>>> mu = 2
>>> sigma = 2
>>> l = lognorm(s=sigma, loc=0, scale=math.exp(mu))
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 54.59815 stddev: 399.71719
I use WolframAlpha as a reference. It provides analytically determined values for the mean and standard deviation of the lognormal distribution.
http://www.wolframalpha.com/input/?i=log-normal+distribution%2C+mean%3D2%2C+sd%3D2
The values match.
WolframAlpha as well as SciPy come up with the mean and standard deviation by evaluating analytical terms. Let's perform an empirical test, by taking many samples from the SciPy distribution, and calculate their mean and standard deviation "manually" (from the whole set of samples):
>>> import numpy as np
>>> samples = l.rvs(size=2*10**7)
>>> print("mean: %.5f stddev: %.5f" % (np.mean(samples), np.std(samples)))
mean: 54.52148 stddev: 380.14457
This is still not perfectly converged, but I think it is proof enough that the samples correspond to the same distribution that WolframAlpha assumed, given μ=2 and σ=2.
And another small edit: it looks like proper usage of a search engine would have helped, we were not the first to be trapped by this:
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
http://nbviewer.ipython.org/url/xweb.geos.ed.ac.uk/~jsteven5/blog/lognormal_distributions.ipynb
scipy, lognormal distribution - parameters
Another edit: now that I know how it behaves, I realize that be behavior in principle is documented. In the "notes" section we can read:
with shape parameter sigma and scale parameter exp(mu)
It is just really not obvious (we both were not able to appreciate the importance of this small sentence). I guess the reason that we could not understand what this sentence means is that the analytical expression shown in the notes section does not include loc and scale. I guess this is worth a bug report / documentation improvement.
Original answer:
Indeed, the shape parameter topic is not well-documented when looking into the docs page for a particular distribution. I recommend having a look at the main stats documentation -- there is a section on shape parameters:
http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#shape-parameters
It looks like there should be a lognorm.shapes property, telling you about what the s parameter means, specifically.
Edit:
There is only one parameter, indeed:
>>> lognorm.shapes
's'
When comparing the general definition of the lognormal distribution (from Wikipedia):
and the formula given by the scipy docs:
lognorm.pdf(x, s) = 1 / (s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2)
it becomes obvious that s is the true σ (sigma).
However, from the docs it is not obvious how the loc parameter is related to μ (mu).
It could be as in ln(x-loc), which would not correspond to μ in the general formula, or it could be ln(x)-loc, which would ensure correspondence between loc and μ. Try it out! :)
Edit 2
I have made comparisons between what WolframAlpha (WA) and SciPy say. WA is pretty clear about that it uses μ and σ as generally understood (as defined in linked Wikipedia article).
>>> l = lognorm(s=2, loc=0)
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 7.38906 stddev: 54.09584
This matches WA's output.
Now, for loc not being zero, there is a mismatch. Example:
>>> l = lognorm(s=2, loc=1)
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 8.38906 stddev: 54.09584
WA gives a mean of 20.08 and a standard deviation of 147. There you have it, loc does not correspond to μ in the classical definition of the lognormal distribution.

Standard error in non-linear regression

I have been doing some Monte Carlo physics simulations with Python and I am in unable to determine the standard error for the coefficients of a non-linear least square fit.
Initially, I was using SciPy's scipy.stats.linregress for my model since I thought it would be a linear model but noticed it is actually some sort of power function. I then used NumPy's polyfit with the degrees of freedom being 2 but I can't find anyway to determine the standard error of the coefficients.
I know gnuplot can determine the errors for me but I need to do fits for over 30 different cases. I was wondering if anyone knows of anyway for Python to read the standard error from gnuplot or is there some other library I can use?
Finally found the answer to this long asked question! I'm hoping this can at least save someone a few hours of hopeless research for this topic. Scipy has a special function called curve_fit under its optimize section. It uses the least square method to determine the coefficients and best of all, it gives you the covariance matrix. The covariance matrix contains the variance of each coefficient. More exactly, the diagonal of the matrix is the variance and by square rooting the values, the standard error of each coefficient can be determined! Scipy doesn't have much documentation for this so here's a sample code for a better understanding:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
def func(x,a,b,c):
return a*x**2 + b*x + c #Refer [1]
x = np.linspace(0,4,50)
y = func(x,2.6,2,3) + 4*np.random.normal(size=len(x)) #Refer [2]
coeff, var_matrix = curve_fit(func,x,y)
variance = np.diagonal(var_matrix) #Refer [3]
SE = np.sqrt(variance) #Refer [4]
#======Making a dictionary to print results========
results = {'a':[coeff[0],SE[0]],'b':[coeff[1],SE[1]],'c':[coeff[2],SE[2]]}
print "Coeff\tValue\t\tError"
for v,c in results.iteritems():
print v,"\t",c[0],"\t",c[1]
#========End Results Printing=================
y2 = func(x,coeff[0],coeff[1],coeff[2]) #Saves the y values for the fitted model
plot.plot(x,y)
plot.plot(x,y2)
plot.show()
What this function returns is critical because it defines what will used to fit for the model
Using the function to create some arbitrary data + some noise
Saves the covariance matrix's diagonal to a 1D matrix which is just a normal array
Square rooting the variance to get the standard error (SE)
it looks like gnuplot uses levenberg-marquardt and there's a python implementation available - you can get the error estimates from the mpfit.covar attribute (incidentally, you should worry about what the error estimates "mean" - are other parameters allowed to adjust to compensate, for example?)

Categories

Resources