Python: Sampling using inverse cdf technique

Python: Sampling using inverse cdf technique - python

I have a complicated (non standard) distribution function that I want to sample to generate simulated data points using the inverse cdf technique. For the sake of this example I will consider a Gaussian distribution
var=100
def f(x,a):
def g(y):
return (1/np.sqrt(2*np.pi*var))*np.exp(-y**2/(2*var))
b,err=integrate.quad(g,-np.inf,x)
return b-a
I want to generate values between a=[0,1], a=np.linspace(0,1,10000,endpoint=False) and use scipy.optimize.fsolve to solve for x for each a.
I have two questions:
How to use fsolve for an array of values a ?
fsolve takes an initial guess x0, how to pick a good guess value?
Thanks

Here's how you do it, I replaced 10000 with 10 as it's going to take a while. My initial guess is just 0, and I set it to the previous iteration for the next guess as it should be quite close to the solution. You can further bound this if you want so it's strictly above it.
As a side comment this kind of sampling for complicated distributions isn't really feasible, as computing the cdf can be rather difficult. There are other sampling techniques to address these issues such as Gibbs sampling, Metropolis Hastings, etc.
var = 100
def f(x, a):
def g(y):
return (1/np.sqrt(2*np.pi*var))*np.exp(-y**2/(2*var))
b, err = sp.integrate.quad(g, -np.inf, x)
return b - a
a = np.linspace(0, 1, 10, endpoint=False)[1:]
x0 = 0
for a_ in a:
xi = sp.optimize.fsolve(f, x0 + 0.01, args=(a_,))[0]
print(xi)
x0 = xi
[EDIT] It seems to get stuck near 0, adding a small number fixes it, I'm not sure why as I don't know how fsolve works.

Related

How to prioritise some points over others using curve fit from SciPy

I want to model the following curve:
To perform it, I'm using curve_fit from SciPy, fitting an exponential function.
def exponenial_func(x, a, b, c):
return a * b**(c*x)
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)))
When I first do it, I get this:
Which is minimising the residuals, each point with the same level of importance.
What I want, is to get a curve that gives more importance to the last values of the curve (from x-axis 30, for example) than to the first values, so it fits better in the end of the curve than in the beginning of it.
I know that from here there are many ways to approach this (first of all, define what is the importance that I want to give to each of the residuals). My question here, is to get some idea of how to approach this.
One idea that I had, is to change the sigma value to weight each data point by its inverse value.
popt, pcov = curve_fit(exponenial_func, x, y, p0=(1,2,2),
bounds=((0, 0, 0), (np.inf, np.inf, np.inf)),
sigma=1/y)
In this case, I get something like I was looking for:
It doesn't look bad, but I'm looking for another way of doing this, so that I can "control" each of the data points, like to weight each of the residuals in a linear way, or exponential, or even choosing it manually (rather than all of them by the inverse, as in the previous case).
Thanks in advance

First of all, note that there's no need for three coefficients. Since
a * b**(c*x) = a * exp(log(b)*c*x).
we can define k = log(b)*c.
Here's a suggestion how you could tackle your problem by hands with scipy.optimize.least_squares and a priority vector:
import numpy as np
from scipy.optimize import least_squares
def exponenial_func2(x, a, k):
return a * np.exp(k*x)
# returns the vector of residuals
def fitwrapper2(coeffs, *args):
xdata, ydata, prio = args
return prio*(exponenial_func2(xdata, *coeffs)-ydata)
# Data
n = 31
xdata = np.arange(n)
ydata = np.array([155.0,229,322,453,655,888,1128,1694,
2036,2502,3089,3858,4636,5883,7375,
9172,10149,12462,12462,17660,21157,
24747,27980,31506,35713,41035,47021,
53578,59138,63927,69176])
# The priority vector
prio = np.ones(n)
prio[-1] = 5
res = least_squares(fitwrapper2, x0=[1.0,2.0], bounds=(0,np.inf), args=(xdata,ydata,prio))
With prio[-1] = 5 we give the last point a high priority.
res.x contains your optimal coefficients. Here a, k = res.x.
Note that for prio = np.ones(n) it's a normal least squares fitting (like curve_fit does) where all points have the same priority.
You can control the priority of each point by increasing its value in the prio array. Comparing both results gives me:

Fitting parameter inside an integral using python (or another useful language)

I have a set of data, basically with the information of f(x) as a function of x, and x itself. I know from the theory of the problem that I'm working on the format of f(x), which is given as the expression below:
Essentially, I want to use this set of data to find the parameters a and b. My problem is: How can I do that? What library should I use? I would like an answer using Python. But R or Julia would be ok as well.
From everything I had done so far, I've read about a functionallity called curve fit from the SciPy library but I'm having some trouble in which form I would do the code as long my x variable is located in one of the integration limit.
For better ways of working with the problem, I also have the following resources:
A sample set, for which I know the parameters I'm looking for. To this set I know that a = 2 and b = 1 (and c = 3). And before it rises some questions about how I know these parameters: I know they because I created this sample set using this parameters from the integration of the equation above just to use the sample to investigate how can I find them and have a reference.
I also have this set, for which the only information I have is that c = 4 and want to find a and b.
I would also like to point out that:
i) right now I have no code to post here because I don't have a clue how to write something to solve my problem. But I would be happy to edit and update the question after reading any answer or help that you guys could provide me.
ii) I'm looking first for a solution where I don't know a and b. But in case that it is too hard I would be happy to see some solution where I suppose that one either a or b is known.
EDIT 1: I would like to reference this question to anyone interested in this problem as it's a parallel but also important discussion to the problem faced here

I would use a pure numeric approach, which you can use even when you can not directly solve the integral. Here's a snipper for fitting only the a parameter:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
import matplotlib.pyplot as plt
def integrand(x, a):
b = 1
c = 3
return 1/(a*np.sqrt(b*(1+x)**3 + c*(1+x)**4))
def integral(x, a):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
df = pd.read_csv('data-with-known-coef-a2-b1-c3.csv')
x = df.domin.values
y = df.resultados2.values
out_mean, out_var = curve_fit(vec_integral, x, y, p0=[2])
plt.plot(x, y)
plt.plot(x, vec_integral(x, out_mean[0]))
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}')
plt.show()
vec_integral = np.vectorize(integral)
Of course, you can lower the value of dx to get the desired precision. While for fitting just the a, when you try to fir b as well, the fit does not converge properly (in my opinion because a and b are strongly correlated). Here's what you get:
def integrand(x, a, b):
c = 3
return 1/(a*np.sqrt(np.abs(b*(1+x)**3 + c*(1+x)**4)))
def integral(x, a, b):
dx = 0.001
xx = np.arange(0, x, dx)
arr = integrand(xx, a, b)
return np.trapz(arr, dx=dx, axis=-1)
vec_integral = np.vectorize(integral)
out_mean, out_var = sp.optimize.curve_fit(vec_integral, x, y, p0=[2,3])
plt.title(f'a = {out_mean[0]:.3f} +- {np.sqrt(out_var[0][0]):.3f}\nb = {out_mean[1]:.3f} +- {np.sqrt(out_var[1][1]):.3f}')
plt.plot(x, y, alpha=0.4)
plt.plot(x, vec_integral(x, out_mean[0], out_mean[1]), color='green', label='fitted solution')
plt.plot(x, vec_integral(x, 2, 1),'--', color='red', label='theoretical solution')
plt.legend()
plt.show()
As you can see, even if the resulting a and b parameters form the fit are "not good", the plot is very similar.

They are three variables a,b,c which are not independent. One of them must be given if we want compute the two others thanks to regression. With given c, solving for a,b is simple :
The example of numerical calculus below is made with a small data (n=10) in order to make it easy to check.
Note that the regression is for the function t(y) wich is not exactly the same as for y(x) when the data is scattered (The result is the same if no scatter).
If it is absolutely necessary to have the regression for y(x) a non-linear regression is necessary. This involves an iterative process starting from good enought initial guess for a,b. The above calculus gives very good initial values.
IN ADDITION :
Meanwhile Andrea posted a pertinent answer. Of course the fitting with his method is better because this is a non-linear regression instead of linear as already pointed out in the above note.
Nevertheless, dispite the different values (a=1.881 ; b=1.617) compared to (a=2.346 , b=-0.361) the respective curves drawn below are not far one from the other :
Blue curve : from linear regression (above method)
Green curve : from non-linear regression ( Andrea's )
CASE OF THE SECOND SET OF DATA
https://mega.nz/#!echEjQyK!tUEx0gpFND7gucvsTONiB_wn-ewBq-5k-pZlfLxmfvw
The regression fails because the assumption c=3 is false.
In the case c=0 the analytic calculus of the integral is different from above :

Finding zeros of equation using python

I'm trying to write code that will find n, in this equation.
with the rest as user defined variables.
from scipy.optimize import fsolve
from scipy.stats import t
def f(alpha, beta, sigma, delta, eps):
n = ((t.ppf(1-alpha,2*n-2) + t.ppf((1-beta)/2,2*n-2))**2*sigma**2)/(2* (delta-abs(eps))**2)
I'd also like to be able to set up different scenarios of parameters and then have it output a table of the parameters and the results (e.g., input alpha1, alpha2, beta1, beta2 etc. and get out [alpha1, beta1,..., n],[alpha1, beta2,...,n]). I'm not quite sure what the best way to do that would be if anyone can genrally point me in the right direction.

By the looks of your equation you are trying to find the number of observations (n) that satisfy the statistical test equation. If that is the case, then n are natural numbers (0, 1, 2..etc.) and are easily iterable.
You could set up a solver yourself, where you have n as the iterable and the equation with result as the "result" of your equation:
for n in range(0, 1000):
result = your_function(n, other_parameters)
Then you simply need to check if the equation is satisfied by setting:
if n >= result:
print "result:", n
break # This will exit the loop
What comes to testing different user given parameters, you can set up another loop which iterates different values for alpha, beta and so on.

How to find all zeros of a function using numpy (and scipy)?

Suppose I have a function f(x) defined between a and b. This function can have many zeros, but also many asymptotes. I need to retrieve all the zeros of this function. What is the best way to do it?
Actually, my strategy is the following:
I evaluate my function on a given number of points
I detect whether there is a change of sign
I find the zero between the points that are changing sign
I verify if the zero found is really a zero, or if this is an asymptote
U = numpy.linspace(a, b, 100) # evaluate function at 100 different points
c = f(U)
s = numpy.sign(c)
for i in range(100-1):
if s[i] + s[i+1] == 0: # oposite signs
u = scipy.optimize.brentq(f, U[i], U[i+1])
z = f(u)
if numpy.isnan(z) or abs(z) > 1e-3:
continue
print('found zero at {}'.format(u))
This algorithm seems to work, except I see two potential problems:
It will not detect a zero that doesn't cross the x axis (for example, in a function like f(x) = x**2) However, I don't think it can occur with the function I'm evaluating.
If the discretization points are too far, there could be more that one zero between them, and the algorithm could fail finding them.
Do you have a better strategy (still efficient) to find all the zeros of a function?
I don't think it's important for the question, but for those who are curious, I'm dealing with characteristic equations of wave propagation in optical fiber. The function looks like (where V and ell are previously defined, and ell is an positive integer):
def f(u):
w = numpy.sqrt(V**2 - u**2)
jl = scipy.special.jn(ell, u)
jl1 = scipy.special.jnjn(ell-1, u)
kl = scipy.special.jnkn(ell, w)
kl1 = scipy.special.jnkn(ell-1, w)
return jl / (u*jl1) + kl / (w*kl1)

Why are you limited to numpy? Scipy has a package that does exactly what you want:
http://docs.scipy.org/doc/scipy/reference/optimize.nonlin.html
One lesson I've learned: numerical programming is hard, so don't do it :)
Anyway, if you're dead set on building the algorithm yourself, the doc page on scipy I linked (takes forever to load, btw) gives you a list of algorithms to start with. One method that I've used before is to discretize the function to the degree that is necessary for your problem. (That is, tune \delta x so that it is much smaller than the characteristic size in your problem.) This lets you look for features of the function (like changes in sign). AND, you can compute the derivative of a line segment (probably since kindergarten) pretty easily, so your discretized function has a well-defined first derivative. Because you've tuned the dx to be smaller than the characteristic size, you're guaranteed not to miss any features of the function that are important for your problem.
If you want to know what "characteristic size" means, look for some parameter of your function with units of length or 1/length. That is, for some function f(x), assume x has units of length and f has no units. Then look for the things that multiply x. For example, if you want to discretize cos(\pi x), the parameter that multiplies x (if x has units of length) must have units of 1/length. So the characteristic size of cos(\pi x) is 1/\pi. If you make your discretization much smaller than this, you won't have any issues. To be sure, this trick won't always work, so you may need to do some tinkering.

I found out it's relatively easy to implement your own root finder using the scipy.optimize.fsolve.
Idea: Find any zeroes from interval (start, stop) and stepsize step by calling the fsolve repeatedly with changing x0. Use relatively small stepsize to find all the roots.
Can only search for zeroes in one dimension (other dimensions must be fixed). If you have other needs, I would recommend using sympy for calculating the analytical solution.
Note: It may not always find all the zeroes, but I saw it giving relatively good results. I put the code also to a gist, which I will update if needed.
import numpy as np
import scipy
from scipy.optimize import fsolve
from matplotlib import pyplot as plt
# Defined below
r = RootFinder(1, 20, 0.01)
args = (90, 5)
roots = r.find(f, *args)
print("Roots: ", roots)
# plot results
u = np.linspace(1, 20, num=600)
fig, ax = plt.subplots()
ax.plot(u, f(u, *args))
ax.scatter(roots, f(np.array(roots), *args), color="r", s=10)
ax.grid(color="grey", ls="--", lw=0.5)
plt.show()
Example output:
Roots: [ 2.84599497 8.82720551 12.38857782 15.74736542 19.02545276]
zoom-in:
RootFinder definition
import numpy as np
import scipy
from scipy.optimize import fsolve
from matplotlib import pyplot as plt
class RootFinder:
def __init__(self, start, stop, step=0.01, root_dtype="float64", xtol=1e-9):
self.start = start
self.stop = stop
self.step = step
self.xtol = xtol
self.roots = np.array([], dtype=root_dtype)
def add_to_roots(self, x):
if (x < self.start) or (x > self.stop):
return # outside range
if any(abs(self.roots - x) < self.xtol):
return # root already found.
self.roots = np.append(self.roots, x)
def find(self, f, *args):
current = self.start
for x0 in np.arange(self.start, self.stop + self.step, self.step):
if x0 < current:
continue
x = self.find_root(f, x0, *args)
if x is None: # no root found.
continue
current = x
self.add_to_roots(x)
return self.roots
def find_root(self, f, x0, *args):
x, _, ier, _ = fsolve(f, x0=x0, args=args, full_output=True, xtol=self.xtol)
if ier == 1:
return x[0]
return None
Test function
The scipy.special.jnjn does not exist anymore, but I created similar test function for the case.
def f(u, V=90, ell=5):
w = np.sqrt(V ** 2 - u ** 2)
jl = scipy.special.jn(ell, u)
jl1 = scipy.special.yn(ell - 1, u)
kl = scipy.special.kn(ell, w)
kl1 = scipy.special.kn(ell - 1, w)
return jl / (u * jl1) + kl / (w * kl1)

The main problem I see with this is if you can actually find all roots --- as have already been mentioned in comments, this is not always possible. If you are sure that your function is not completely pathological (sin(1/x) was already mentioned), the next one is what's your tolerance to missing a root or several of them. Put differently, it's about to what length you are prepared to go to make sure you did not miss any --- to the best of my knowledge, there is no general method to isolate all the roots for you, so you'll have to do it yourself. What you show is a reasonable first step already. A couple of comments:
Brent's method is indeed a good choice here.
First of all, deal with the divergencies. Since in your function you have Bessels in the denominators, you can first solve for their roots -- better look them up in e.g., Abramovitch and Stegun (Mathworld link). This will be a better than using an ad hoc grid you're using.
What you can do, once you've found two roots or divergencies, x_1 and x_2, run the search again in the interval [x_1+epsilon, x_2-epsilon]. Continue until no more roots are found (Brent's method is guaranteed to converge to a root, provided there is one).
If you cannot enumerate all the divergencies, you might want to be a little more careful in verifying a candidate is indeed a divergency: given x don't just check that f(x) is large, check that, e.g. |f(x-epsilon/2)| > |f(x-epsilon)| for several values of epsilon (1e-8, 1e-9, 1e-10, something like that).
If you want to make sure you don't have roots which simply touch zero, look for the extrema of the function, and for each extremum, x_e, check the value of f(x_e).

I've also encountered this problem to solve equations like f(z)=0 where f was an holomorphic function. I wanted to be sure not to miss any zero and finally developed an algorithm which is based on the argument principle.
It helps to find the exact number of zeros lying in a complex domain. Once you know the number of zeros, it is easier to find them. There are however two concerns which must be taken into account :
Take care about multiplicity : when solving (z-1)^2 = 0, you'll get two zeros as z=1 is counting twice
If the function is meromorphic (thus contains poles), each pole reduce the number of zero and break the attempt to count them.

scipy optimize - fmin Nelder-Mead simplex

I'm trying to use the scipy Nelder-Mead simplex search function to find a minimum to a non-linear function. It appears my simplex gets stuck because it starts off with an initial simplex that is too small. Unfortunately, I don't see anywhere in scipy where you can change some of the simplex parameters (e.g. initial simplex size). Is there a way? Am I missing something? Or are there other implementations of the NM simplex?
Thanks

Two suggestions for Nelder-Mead:
1) snap all x to a grid, say .01, inside the function:
x = np.round( x / grid ) * grid
f = ...
This acts as a simple noise filter in high dimensions
(in 2d or 3d, don't bother).
2) start off with the best d+1 of 2d+1 nearby points,
instead of the usual d+1:
def neard1( func, x, h, verbose=1 ):
""" eval func at 2d+1 points x, x +- h
sort
-> f[ d+1 best values ], X[ d+1 ]
to start or restart Nelder-Mead
"""
dim = len(x)
I = np.eye(dim)
np.fill_diagonal( I, h ) # scalar or vec
X = x + np.vstack(( np.zeros(dim), I, - I ))
fnear = np.array([ func( x ) for x in X ]) # 2d+1
f0 = fnear[0]
up = np.argsort( fnear ) # vec func: |fnear|
if verbose:
print "neard1: f %g +- %s around x %s" % (
f0, fnear[up] - f0, x )
bestd1 = up[:dim+1]
return fnear[bestd1], X[bestd1]
It's also not a bad idea to look at the neard1() values after Nelder-Mead,
to get an idea of what func() looks like there.
If any neighbors are better then the N-M "best", restart N-M from that new simplex.
(One can alternate neard1, N-M, neard1, N-M: easy but very problem-dependent.)
How many variables do you have, and how noisy is your function ?
Hope this helps

From the reference at http://docs.scipy.org/doc/:
Method Nelder-Mead uses the Simplex algorithm [R123], [R124]. This algorithm has been successful in many applications but other algorithms using the first and/or second derivatives information might be preferred for their better performances and robustness in general.
It may be recommended to use a completely different algorithm, then. Note that:
Method BFGS uses the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno (BFGS) [R127] pp. 136. It uses the first derivatives only. BFGS has proven good performance even for non-smooth optimizations. This method also returns an approximation of the Hessian inverse, stored as hess_inv in the OptimizeResult object.
BFGS sounds more robust and faster overall.
ParagonRG

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.