Scipy's quad function can be used to numerically integrate indefinite integrals. However, some functions have a rather narrow range where most of their area is (for example, likelihood functions) and quad sometimes misses it. It returns that the integral is approximately 0 when it really just missed the range of the function that isn't 0.
For example, the area under the curve for a log normal distribution from 0 to inf should be 1. Here it succeeds with a geometric mean of 1 but not 2:
from scipy.integrate import quad
from scipy.stats import lognorm
from scipy import inf
quad(lambda x: lognorm.pdf(x, 0.01, scale=1), 0, inf)
# (1.0000000000000002, 1.6886909404731594e-09)
quad(lambda x: lognorm.pdf(x, 0.01, scale=2), 0, inf)
# (6.920637959567767e-14, 1.2523928482954713e-13)
I often know beforehand approximately where the bulk of the mass is. How do I tell quad to start there? If this isn't possible, I'll accept a different tool.
The points parameter of the quad method can be used to tell it where (approximately) it should look. It can't be used with an improper integral, so the range of integration needs to be split into the finite interval up to the last point, plus an infinite tail.
points = (0.1, 1, 10, 100)
func = lambda x: lognorm.pdf(x, 0.01, scale=2) # works for other scales too
integral = quad(func, 0, points[-1], points=points)[0] + quad(func, points[-1], np.inf)[0]
A geometric sequence of points, like in this example, is good enough for a wide range of scales.
If the boundaries are 0, -inf, or inf, then the first guess made by quad is always 1. This can be exploited by shifting the integral so that the waypoint is at 1. For example, shifting the log normal distribution so that its mode is at 1 doesn't change the area, but guarantees that quad finds the bulk of the distribution:
from scipy.integrate import quad
from scipy.stats import lognorm
from scipy import exp, log, inf
mode = exp(log(2) - 0.01**2)
quad(lambda x: lognorm.pdf(x + mode - 1, 0.01, scale=2), -inf, inf)
# (0.9999999999999984, 2.2700129642154882e-09)
This only works if there is only one point of interest and the bounds are -inf to inf (otherwise, shifting the function shifts the bounds, which changes the first guess). If so, then this allows the integral to be computed with a single call to quad.
Related
I can get a uniform grid on [0,2*pi) with numpy's function np.arange(), however, I would want a grid with the same number of points but having more density of points on certain interval, i.e having a finer grid on [pi,1.5*pi] for example. How can I achieve this, is there a numpy function that accepts a density function and it's output is the grid with that density?
I'm surprised that I can't find a similar Q&A on Stack Overflow. There are a few on doing something similar for random numbers from a discrete distribution, but not for continuous distributions and also not as a modified np.arange or np.linspace.
If you need to get an x range for plotting that has finer sampling in areas where you expect the function to fluctuate more rapidly, you can create a nonlinear function that takes inputs in the range 0 to 1 and produces outputs in the same range that proceeds nonlinearly. For example:
def f(x):
return x**2
angles = 2*np.pi*f(np.linspace(0, 1, num, endpoint=False))
This will produce fine sampling near zero and coarse sampling near 2*pi.
For more fine-grained control of the sampling density, you can use the function below. As a bonus, it will also allow random sampling.
import numpy as np
def density_space(xs, ps, n, endpoint=False, order=1, random=False):
"""Draw samples with spacing specified by a density function.
Copyright Han-Kwang Nienhuys (2020).
License: any of: CC-BY, CC-BY-SA, BSD, LGPL, GPL.
Reference: https://stackoverflow.com/a/62740029/6228891
Parameters:
- xs: array, ordered by increasing values.
- ps: array, corresponding densities (not normalized).
- n: number of output values.
- endpoint: whether to include x[-1] in the output.
- order: interpolation order (1 or 2). Order 2 will
require dense sampling and a smoothly varying density
to work correctly.
- random: whether to return random samples, ignoring endpoint).
in this case, n can be a shape tuple.
Return:
- array, shape (n,), with values from xs[0] to xs[-1]
"""
from scipy.interpolate import interp1d
from scipy.integrate import cumtrapz
cps = cumtrapz(ps, xs, initial=0)
cps *= (1/cps[-1])
intfunc = interp1d(cps, xs, kind=order)
if random:
return intfunc(np.random.uniform(size=n))
else:
return intfunc(np.linspace(0, 1, n, endpoint=endpoint))
Test:
values = density_space(
[0, 100, 101, 200],
[1, 1, 2, 2],
n=12, endpoint=True)
print(np.around(values))
[ 0. 27. 54. 82. 105. 118. 132. 146. 159. 173. 186. 200.]
The cumulative density function is created using trapezoid integration, which is essentially based on linear interpolation. A higher-order integration is not safe because the input may have (near-)discontinuities, like the jump from x=100 to x=101 in the example. A discontinuity in the input results in a discontinuous first derivative in the cumulative density function (cps in the code), which will cause problems with a smooth interpolation (order 2 or above). Hence the recommendation to use order=2 only for a smooth density function - and not to use any higher orders.
I want to integrate a Gaussian function over a very large interval. I chose spicy.integrate.quad function for the integration. The function seems to work only when I select a small enough interval. When I use the codes below,
from scipy.integrate import quad
from math import pi, exp, sqrt
def func(x, mean, sigma):
return 1/(sqrt(2*pi)*sigma) * exp(-1/2*((x-mean)/sigma)**2)
print(quad(func, 0, 1e+31, args=(1e+29, 1e+28))[0]) # case 1
print(quad(func, 0, 1e+32, args=(1e+29, 1e+28))[0]) # case 2
print(quad(func, 0, 1e+33, args=(1e+29, 1e+28))[0]) # case 3
print(quad(func, 1e+25, 1e+33, args=(1e+29, 1e+28))[0]) # case 4
then the followings are printed.
1.0
1.0000000000000004
0.0
0.0
To obtain a reasonable result, I had to try and change the lower/upper bounds of the integral several times and empirically determine it to [0, 1e+32]. This seems risky to me, as when the mean and sigma of the gaussian function changes, then I always have to try different bounds.
Is there a clear way to integrate the function from 0 to 1e+50 without bothering with bounds? If not, how do you expect from beginning which bounds would give non-zero value?
In short, you can't.
On this long interval, the region where the gaussian is non-zero is tiny, and the adaptive procedure which works under the hood of integrate.quad fails to see it. And so would pretty much any adaptive routine, unless by chance.
Notice,
and the CDF of a normal random variable is known as ϕ(x) as it can not be expressed by an elementary function. So take ϕ((b-m)/s) - ϕ((a-m)/s). Also note that ϕ(x) = 1/2(1 + erf(x/sqrt(2))) so you need not call .quad to actually perform an integration and may have better luck with erf from scipy.
from scipy.special import erf
def prob(mu, sigma, a, b):
phi = lambda x: 1/2*(1 + erf((x - mu)/(sigma*np.sqrt(2))))
return phi(b) - phi(a)
This may give more accurate results (it does than the above)
>>> print(prob(0, 1e+31, 0, 1e+50))
0.5
>>> print(prob(0, 1e+32, 1e+28, 1e+29))
0.000359047985937333
>>> print(prob(0, 1e+33, 1e+28, 1e+29))
3.5904805169684195e-05
>>> print(prob(1e+25, 1e+33, 1e+28, 1e+29))
3.590480516979522e-05
and avoid the intense floating point error you are experiencing. However, the regions you integrate are so small in area that you may still see 0.
I am trying to compute the integrals more precise by specifying the parameter epsabs for scipy.integrate.quad, say we are integrating the function sin(x) / x^2 from 1e-16 to 1.0
from scipy.integrate import quad
import numpy
integrand = lambda x: numpy.sin(x) / x ** 2
integral = quad(integrand, 1e-16, 1.0)
This gives us
(36.760078801255595, 0.01091187908038005)
To make the results more precise, we specify the absolute error tolerance by epsabs
from scipy.integrate import quad
import numpy
integrand = lambda x: numpy.sin(x) / x ** 2
integral = quad(integrand, 1e-16, 1.0, epsabs = 1e-4)
The result is exactly the same and the error is still as large as 0.0109! Am I understanding the parameter epsabs wrong? What should I do differently to increase the precision of integral?
According to scipy manual quad function has limit argument to specify
An upper bound on the number of subintervals used in the adaptive algorithm.
By default the value of limit is 50.
You code return warning message
quadpack.py:364: IntegrationWarning: The maximum number of
subdivisions (50) has been achieved. If increasing the limit yields
no improvement it is advised to analyze the integrand in order to
determine the difficulties. If the position of a local difficulty
can be determined (singularity, discontinuity) one will probably
gain from splitting up the interval and calling the integrator on
the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
You have to change limit argument, i.e.:
from scipy.integrate import quad
import numpy
integrand = lambda x: numpy.sin(x) / x ** 2
print(quad(integrand, 1e-16, 1.0, epsabs = 1e-4, limit=100))
Output:
(36.7600787611414, 3.635057215414274e-05)
There is no warning message in output. Number of subdivisions is under 100 and quad got required accuracy.
There is a function which determine the intensity of the Fraunhofer diffraction pattern of a circular aperture... (more information)
Integral of the function in distance x= [-3.8317 , 3.8317] must be about 83.8% ( If assume that I0 is 100) and when you increase the distance to [-13.33 , 13.33] it should be about 95%.
But when I use integral in python, the answer is wrong.. I don't know what's going wrong in my code :(
from scipy.integrate import quad
from scipy import special as sp
I0=100.0
dist=3.8317
I= quad(lambda x:( I0*((2*sp.j1(x)/x)**2)) , -dist, dist)[0]
print I
Result of the integral can't be bigger than 100 (I0) because this is the diffraction of I0 ... I don't know.. may be scaling... may be the method! :(
The problem seems to be in the function's behaviour near zero. If the function is plotted, it looks smooth:
However, scipy.integrate.quad complains about round-off errors, which is very strange with this beautiful curve. However, the function is not defined at 0 (of course, you are dividing by zero!), hence the integration does not go well.
You may use a simpler integration method or do something about your function. You may also be able to integrate it to very close to zero from both sides. However, with these numbers the integral does not look right when looking at your results.
However, I think I have a hunch of what your problem is. As far as I remember, the integral you have shown is actually the intensity (power/area) of Fraunhofer diffraction as a function of distance from the center. If you want to integrate the total power within some radius, you will have to do it in two dimensions.
By simple area integration rules you should multiply your function by 2 pi r before integrating (or x instead of r in your case). Then it becomes:
f = lambda(r): r*(sp.j1(r)/r)**2
or
f = lambda(r): sp.j1(r)**2/r
or even better:
f = lambda(r): r * (sp.j0(r) + sp.jn(2,r))
The last form is best as it does not suffer from any singularities. It is based on Jaime's comment to the original answer (see the comment below this answer!).
(Note that I omitted a couple of constants.) Now you can integrate it from zero to infinity (no negative radii):
fullpower = quad(f, 1e-9, np.inf)[0]
Then you can integrate from some other radius and normalize by the full intensity:
pwr = quad(f, 1e-9, 3.8317)[0] / fullpower
And you get 0.839 (which is quite close to 84 %). If you try the farther radius (13.33):
pwr = quad(f, 1e-9, 13.33)
which gives 0.954.
It should be noted that we introduce a small error by starting the integration from 1e-9 instead of 0. The magnitude of the error can be estimated by trying different values for the starting point. The integration result changes very little between 1e-9 and 1e-12, so they seem to be safe. Of course, you could use, e.g., 1e-30, but then there may be numerical instability in the division. (In this case there isn't, but in general singularities are numerically evil.)
Let us do one thing still:
import matplotlib.pyplot as plt
import numpy as np
x = linspace(0.01, 20, 1000)
intg = np.array([ quad(f, 1e-9, xx)[0] for xx in x])
plt.plot(x, intg/fullpower)
plt.grid('on')
plt.show()
And this is what we get:
At least this looks right, the dark fringes of the Airy disk are clearly visible.
What comes to the last part of the question: I0 defines the maximum intensity (the units may be, e.g. W/m2), whereas the integral gives total power (if the intensity is in W/m2, the total power is in W). Setting the maximum intensity to 100 does not guarantee anything about the total power. That is why it is important to calculate the total power.
There actually exists a closed form equation for the total power radiated onto a circular area:
P(x) = P0 ( 1 - J0(x)^2 - J1(x)^2 ),
where P0 is the total power.
Note that you also can get a closed form solution for your integration using Sympy:
import sympy as sy
sy.init_printing() # LaTeX like pretty printing in IPython
x,d = sy.symbols("x,d", real=True)
I0=100
dist=3.8317
f = I0*((2*sy.besselj(1,x)/x)**2) # the integrand
F = f.integrate((x, -d, d)) # symbolic integration
print(F.evalf(subs={d:dist})) # numeric evalution
F evaluates to:
1600*d*besselj(0, Abs(d))**2/3 + 1600*d*besselj(1, Abs(d))**2/3 - 800*besselj(1, Abs(d))**2/(3*d)
with besselj(0,r) corresponding to sp.j0(r).
They might be a singularity in the integration algorithm when doing the jacobian at x = 0. You can exclude this points from the integration with "points":
f = lambda x:( I0*((2*sp.j1(x)/x)**2))
I = quad(f, -dist, dist, points = [0])
I get then the following result (is this your desired result?)
331.4990321315221
I've discovered a strange behavior when using scipy.integrate.quad. This behavior also shows up in Octave's quad function, which leads me to believe that it may have something to do with QUADPACK itself. Interestingly enough, using the exact same Octave code, this behavior does not show up in MATLAB.
On to the question. I'm numerically integrating a lognormal distribution over various bounds. For F is cdf of lognormal, a is lower bound and b is upper bound, I find that under some conditions,
integral(F, a, b) = 0 when b is a "very large number," while
integral(F, a, b) = the correct limit when b is np.inf. (or just Inf for Octave.)
Here's some example code to show it in action:
from __future__ import division
import numpy as np
import scipy.stats as stats
from scipy.integrate import quad
# Set up the probability space:
sigma = 0.1
mu = -0.5*(sigma**2) # To get E[X] = 1
N = 7
z = stats.lognormal(sigma, 0, np.exp(mu))
# Set up F for integration:
F = lambda x: x*z.pdf(x)
# An example that appears to work correctly:
a, b = 1.0, 10
quad(F, a, b)
# (0.5199388..., 5.0097567e-11)
# But if we push it higher, we get a value which drops to 0:
quad(F, 1.0, 1000)
# (1.54400e-11, 3.0699e-11)
# HOWEVER, if we shove np.inf in there, we get correct answer again:
quad(F, 1.0, np.inf)
# (0.5199388..., 3.00668e-09)
# If we play around we can see where it "breaks:"
quad(F, 1.0, 500) # Ok
quad(F, 1.0, 831) # Ok
quad(F, 1.0, 832) # Here we suddenly hit close to zero.
quad(F, 1.0, np.inf) # Ok again
What is going on here? Why does quad(F, 1.0, 500) evaluate to approximately the correct thing, but quad(F, 1.0, b) goes to zero for all values 832 <= b < np.inf?
While I'm not exactly familiar with QUADPACK, adaptive integration generally works by increasing resolution until the answer no longer improves. Your function is so close to 0 for most of the interval (with F(10)==9.356e-116) that the improvement is negligible for the initial grid points that quad chooses, and it decides that the integral must be close to 0. Basically, if your data hides in a very narrow subinterval of the range of integration, quad eventually won't be able to find it.
For integration from 0 to inf, the interval obviously cannot be subdivided into a finite number of intervals, so quad will need some preprocessing before computing the integral. For example, a change of variables like y=1/(1+x) would map the interval 0..inf to 0..1. Subdividing that interval will sample more points near zero from the original function, enabling quad to find your data.
try lowering the error tolerance
>>> quad(F, a, 1000, epsabs=1.49e-11)
(0.5199388058383727, 2.6133800952484582e-11)
I guess numerical integration is just sensitive to certain configuration. You can try to debug it by calling quad(..., full_output=1) and analyzing the verbose output carefully. Sorry if the answer is not satisfactory though.