Scipy: speed up integration when doing it for the whole surface? - python

I have a probability density function (pdf) f(x,y). And to get its cumulative distribution function (cdf)F(x,y) at point (x,y), you need to integrate the f(x,y), like this:
In Scipy, I can do it by integrate.nquad:
x, y=5, 4
F_at_x_y = integrate.nquad(f, [[-inf, x],[-inf, y]])
Now, I need the whole F(x,y) in the x-y panel, something look like this:
How can I do that?
The main problem, is that for every point ranging from (-30,-30) to (30,30), I need to do a integrate.nquad from scratch to get the F(x,y). This is too slow.
I'm wondering, since the results are successive (For example, you get F(5,6) by the value of F(4,4), and integrate from the regions between these 2 points), if it is possible to speed up the process? So we don't need to do integrate from scratch at every point, and hence make the process quicker.
Possible useful links:
Multivariate Normal CDF in Python using scipy
http://cn.mathworks.com/help/stats/mvncdf.html
And I'm thinking about borrowing something from Fibonacci Sequence
How to write the Fibonacci Sequence in Python

In the end, this is what I've done:
F is cdf, f is pdf
F(5,5) = F(5,4) + F(4,5) - 2 *F(4,4) + f(5,5)
And loop through the whole surface, you can get the result.
The code would look like this:
def cdf_from_pdf(pdf):
if not isinstance(pdf[0], np.ndarray):
original_dim = int(np.sqrt(len(pdf)))
pdf = pdf.reshape(original_dim,original_dim)
cdf = np.copy(pdf)
xdim, ydim = cdf.shape
for i in xrange(1,xdim):
cdf[i,0] = cdf[i-1,0] + cdf[i,0]
for i in xrange(1,ydim):
cdf[0,i] = cdf[0,i-1] + cdf[0,i]
for j in xrange(1,ydim):
for i in xrange(1,xdim):
cdf[i,j] = cdf[i-1,j] + cdf[i,j-1] - cdf[i-1,j-1] + pdf[i,j]
return cdf
This is a very rough approximate, and you can perfect the result by changing the +/- equantion into integration.
As for the original value and the margin, cdf[0,:] and cdf[:,0], you can also use integration as well. In my case, it's very small, so I just use the pdf value.
You can test the function by plotting the cdf, or check the value at the cdf[n,n]

Related

Demonstrating the Universality of the Uniform using numpy - an issue with transformation

Recently I wanted to demonstrate generating a continuous random variable using the universality of the Uniform. For that, I wanted to use the combination of numpy and matplotlib. However, the generated random variable seems a little bit off to me - and I don't know whether it is caused by the way in which NumPy's random uniform and vectorized works or if I am doing something fundamentally wrong here.
Let U ~ Unif(0, 1) and X = F^-1(U). Then X is a real variable with a CDF F (please note that the F^-1 here denotes the quantile function, I also omit the second part of the universality because it will not be necessary).
Let's assume that the CDF of interest to me is:
then:
According to the universality of the uniform, to generate a real variable, it is enough to plug U ~ Unif(0, 1) in the F-1. Therefore, I've written a very simple code snippet for that:
U = np.random.uniform(0, 1, 1000000)
def logistic(u):
x = np.log(u / (1 - u))
return x
logistic_transform = np.vectorize(logistic)
X = logistic_transform(U)
However, the result seems a little bit off to me - although the histogram of a generated real variable X resembles a logistic distribution (which simplified CDF I've used) - the r.v. seems to be distributed in a very unequal way - and I can't wrap my head around exactly why it is so. I would be grateful for any suggestions on that. Below are the histograms of U and X.
You have a large sample size, so you can increase the number of bins in your histogram and still get a good number samples per bin. If you are using matplotlib's hist function, try (for exampe) bins=400. I get this plot, which has the symmetry that I think you expected:
Also--and this is not relevant to the question--your function logistic will handle a NumPy array without wrapping it with vectorize, so you can save a few CPU cycles by writing X = logistic(U). And you can save a few lines of code by using scipy.special.logit instead of implementing it yourself.

Seeming inconsistency in scipy.integrate.quad operation under change of variable

I have an algorith that seeks to multiply two functions and integrate the result. One is a small normal distribution around 1.0, with a small standard deviation of abouty .1; the other is also a normal, but much larger, with mean 100,000 and stdev 15,000.
The code looks like so:
def integrand(x, z, mu, sigma) :
# goal is to return return f_x(x)*f_y(z/x)*1/abs(x)
# return normal(x, mu, sigma) * normal(z/x, 100000, 15000)/abs(x)
return normal(z/x, mu, sigma) * normal(x, 100000, 15000)/abs(x)
pResult = quad(integrand, -10000000,10000000,
args=(_z, MU, SIGMA),
points = [1.0, 2.0, 100000.0],
epsabs = 1, epsrel = .01)
This is a formula for multiplying random variables, Z = XY. It would seem that I could swap where the z/x goes in the integrand, but in trying to see which way runs faster (by commenting out one of the two return statements in integrand()) I was surprised to get different results (one is 114,221.4 and the other is 116,174.2). Any idea why?
Also, when I cycle through a wide range of z values and plot P(z) vs z, the graphs look different, One looks right (like a normal distribution - using the first return statement) and the other looks like a top-hat with soft corners.
The math for this algorithm: https://en.wikipedia.org/wiki/Product_distribution
Scaling matters, as I noted in the answer to your previous question, and recommended to use points to help quad locate relatively small features.
You rescale the function but keep the same interval of integration. Assuming the function essentially "fits" in the interval (the tails outside of the interval are negligible), the difference is due to the algorithm being able or unable to locate the small feature of the function (the narrow Gaussian) within the interval of integration. The more narrow is the function support, the more likely is quad to get integration wrong.
I also recommend putting accuracy as the top priority; speed comes after that. It's easy to write a function that returns "0" quickly.

Reconstructing curve from gradient

Suppose I have a curve, and then I estimate its gradient via finite differences by using np.gradient. Given an initial point x[0] and the gradient vector, how can I reconstruct the original curve? Mathematically I see its possible given this system of equations, but I'm not certain how to do it programmatically.
Here is a simple example of my problem, where I have sin(x) and I compute the numerical difference, which matches cos(x).
test = np.vectorize(np.sin)(x)
numerical_grad = np.gradient(test, 30./100)
analytical_grad = np.vectorize(np.cos)(x)
## Plot data.
ax.plot(test, label='data', marker='o')
ax.plot(numerical_grad, label='gradient')
ax.plot(analytical_grad, label='proof', alpha=0.5)
ax.legend();
I found how to do it, by using numpy's trapz function (trapezoidal rule integration).
Following up on the code I presented on the question, to reproduce the input array test, we do:
x = np.linspace(1, 30, 100)
integral = list()
for t in range(len(x)):
integral.append(test[0] + np.trapz(numerical_grad[:t+1], x[:t+1]))
The integral array then contains the results of the numerical integration.
You can restore initial curve using integration.
As life example: If you have function for position for 1D moving, you can get function for velocity as derivative (gradient)
v(t) = s(t)' = ds / dt
And having velocity, you can potentially get position (not all functions are integrable analytically - in this case numerical integration is used) with some unknown constant (shift) added - and with initial position you can restore exact value
s(T) = Integral[from 0 to T](v(t)dt) + s(0)

frequency analysis with unevenly spaced data in python

I have a signal generated by a simulation program. Because the solver in this program has a variable time step, I have a signal with unevenly spaced data. I have two lists, a list with the signal values, and another list with the times at which each value occurred. The data could be something like this
npts = 500
t=logspace(0,1,npts)
f1 = 0.5
f2 = 0.6
sig=(1+sin(2*pi*f1*t))+(1+sin(2*pi*f2*t))
I would like to be able to perform a frequency analysis on this signal using python. It seems I cannot use the fft function in numpy, because this requires evenly spaced data. Are there any standard functions which could help me find the frequencies contained in this signal?
The most common algorithm to solve such things is called a Least-Squares Spectral analysis of frequencies. It looks like this will be in a future release of the scipy.signals package. Maybe there is a current version, but I can't seem to find it... In addition, there is some code available from Astropython, which I will not copy in it's entirety, but it essentially creates a lomb class which you can use the following code to get some values out.. What you need to do is the following:
import numpy
import lomb
x = numpy.arange(10)
y = numpy.sin(x)
fx,fy, nout, jmax, prob = lomb.fasper(x,y, 6., 6.)
Very simple, just look up the formula for a Fourier transform, and implement it as a discrete sum over your data values:
given a set of values f(x) over some set of x, then for each frequency k,
F(k) = sum_x ( exp( +/-i * k *x ) )
choose your k's ranging from 0 to 2*pi / min separation in x.
and, you can use 2 * pi / max(x) as the increment size
For a test case, use something for which you known the correct answer, c.f., a single cos( k' * x ) for some k', or a Gaussian.
An easy way out is to interpolate to evenly-spaced time intervals

resampling, interpolating matrix

I'm trying to interpolate some data for the purpose of plotting. For instance, given N data points, I'd like to be able to generate a "smooth" plot, made up of 10*N or so interpolated data points.
My approach is to generate an N-by-10*N matrix and compute the inner product the original vector and the matrix I generated, yielding a 1-by-10*N vector. I've already worked out the math I'd like to use for the interpolation, but my code is pretty slow. I'm pretty new to Python, so I'm hopeful that some of the experts here can give me some ideas of ways I can try to speed up my code.
I think part of the problem is that generating the matrix requires 10*N^2 calls to the following function:
def sinc(x):
import math
try:
return math.sin(math.pi * x) / (math.pi * x)
except ZeroDivisionError:
return 1.0
(This comes from sampling theory. Essentially, I'm attempting to recreate a signal from its samples, and upsample it to a higher frequency.)
The matrix is generated by the following:
def resampleMatrix(Tso, Tsf, o, f):
from numpy import array as npar
retval = []
for i in range(f):
retval.append([sinc((Tsf*i - Tso*j)/Tso) for j in range(o)])
return npar(retval)
I'm considering breaking up the task into smaller pieces because I don't like the idea of an N^2 matrix sitting in memory. I could probably make 'resampleMatrix' into a generator function and do the inner product row-by-row, but I don't think that will speed up my code much until I start paging stuff in and out of memory.
Thanks in advance for your suggestions!
This is upsampling. See Help with resampling/upsampling for some example solutions.
A fast way to do this (for offline data, like your plotting application) is to use FFTs. This is what SciPy's native resample() function does. It assumes a periodic signal, though, so it's not exactly the same. See this reference:
Here’s the second issue regarding time-domain real signal interpolation, and it’s a big deal indeed. This exact interpolation algorithm provides correct results only if the original x(n) sequence is periodic within its full time inter­val.
Your function assumes the signal's samples are all 0 outside of the defined range, so the two methods will diverge away from the center point. If you pad the signal with lots of zeros first, it will produce a very close result. There are several more zeros past the edge of the plot not shown here:
Cubic interpolation won't be correct for resampling purposes. This example is an extreme case (near the sampling frequency), but as you can see, cubic interpolation isn't even close. For lower frequencies it should be pretty accurate.
If you want to interpolate data in a quite general and fast way, splines or polynomials are very useful. Scipy has the scipy.interpolate module, which is very useful. You can find many examples in the official pages.
Your question isn't entirely clear; you're trying to optimize the code you posted, right?
Re-writing sinc like this should speed it up considerably. This implementation avoids checking that the math module is imported on every call, doesn't do attribute access three times, and replaces exception handling with a conditional expression:
from math import sin, pi
def sinc(x):
return (sin(pi * x) / (pi * x)) if x != 0 else 1.0
You could also try avoiding creating the matrix twice (and holding it twice in parallel in memory) by creating a numpy.array directly (not from a list of lists):
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
for j in xrange(o):
retval[i][j] = sinc((Tsf*i - Tso*j)/Tso)
return retval
(replace xrange with range on Python 3.0 and above)
Finally, you can create rows with numpy.arange as well as calling numpy.sinc on each row or even on the entire matrix:
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
retval[i] = numpy.arange(Tsf*i / Tso, Tsf*i / Tso - o, -1.0)
return numpy.sinc(retval)
This should be significantly faster than your original implementation. Try different combinations of these ideas and test their performance, see which works out the best!
I'm not quite sure what you're trying to do, but there are some speedups you can do to create the matrix. Braincore's suggestion to use numpy.sinc is a first step, but the second is to realize that numpy functions want to work on numpy arrays, where they can do loops at C speen, and can do it faster than on individual elements.
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.sinc((Tsi*numpy.arange(i)[:,numpy.newaxis]
-Tso*numpy.arange(j)[numpy.newaxis,:])/Tso)
return retval
The trick is that by indexing the aranges with the numpy.newaxis, numpy converts the array with shape i to one with shape i x 1, and the array with shape j, to shape 1 x j. At the subtraction step, numpy will "broadcast" the each input to act as a i x j shaped array and the do the subtraction. ("Broadcast" is numpy's term, reflecting the fact no additional copy is made to stretch the i x 1 to i x j.)
Now the numpy.sinc can iterate over all the elements in compiled code, much quicker than any for-loop you could write.
(There's an additional speed-up available if you do the division before the subtraction, especially since inthe latter the division cancels the multiplication.)
The only drawback is that you now pay for an extra Nx10*N array to hold the difference. This might be a dealbreaker if N is large and memory is an issue.
Otherwise, you should be able to write this using numpy.convolve. From what little I just learned about sinc-interpolation, I'd say you want something like numpy.convolve(orig,numpy.sinc(numpy.arange(j)),mode="same"). But I'm probably wrong about the specifics.
If your only interest is to 'generate a "smooth" plot' I would just go with a simple polynomial spline curve fit:
For any two adjacent data points the coefficients of a third degree polynomial function can be computed from the coordinates of those data points and the two additional points to their left and right (disregarding boundary points.) This will generate points on a nice smooth curve with a continuous first dirivitive. There's a straight forward formula for converting 4 coordinates to 4 polynomial coefficients but I don't want to deprive you of the fun of looking it up ;o).
Here's a minimal example of 1d interpolation with scipy -- not as much fun as reinventing, but.
The plot looks like sinc, which is no coincidence:
try google spline resample "approximate sinc".
(Presumably less local / more taps ⇒ better approximation,
but I have no idea how local UnivariateSplines are.)
""" interpolate with scipy.interpolate.UnivariateSpline """
from __future__ import division
import numpy as np
from scipy.interpolate import UnivariateSpline
import pylab as pl
N = 10
H = 8
x = np.arange(N+1)
xup = np.arange( 0, N, 1/H )
y = np.zeros(N+1); y[N//2] = 100
interpolator = UnivariateSpline( x, y, k=3, s=0 ) # s=0 interpolates
yup = interpolator( xup )
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
print "yup:", yup
pl.plot( x, y, "green", xup, yup, "blue" )
pl.show()
Added feb 2010: see also basic-spline-interpolation-in-a-few-lines-of-numpy
Small improvement. Use the built-in numpy.sinc(x) function which runs in compiled C code.
Possible larger improvement: Can you do the interpolation on the fly (as the plotting occurs)? Or are you tied to a plotting library that only accepts a matrix?
I recommend that you check your algorithm, as it is a non-trivial problem. Specifically, I suggest you gain access to the article "Function Plotting Using Conic Splines" (IEEE Computer Graphics and Applications) by Hu and Pavlidis (1991). Their algorithm implementation allows for adaptive sampling of the function, such that the rendering time is smaller than with regularly spaced approaches.
The abstract follows:
A method is presented whereby, given a
mathematical description of a
function, a conic spline approximating
the plot of the function is produced.
Conic arcs were selected as the
primitive curves because there are
simple incremental plotting algorithms
for conics already included in some
device drivers, and there are simple
algorithms for local approximations by
conics. A split-and-merge algorithm
for choosing the knots adaptively,
according to shape analysis of the
original function based on its
first-order derivatives, is
introduced.

Categories

Resources