I try to fit a function to my data using scipy.optimize.curvefit.
Q=optimization.curve_fit(func,X,Y, x0,ERR)
and it works well.
However, now I am trying to use an asymmetric error and I have no idea how to do that - or even if it is possible.
By asymmetric error I mean that the error is not for example: 3+-0.5 but 3 +0.6 -0.2.
So that ERR is an array with two columns.
It would be great if somebody had an idea how to do that - or could me point to a different Python routine which might be able to do it.
That a snippet of the code I am using - but I am not sure it makes it clearer:
A=numpy.genfromtxt('WF.dat')
cc=A[:,4]
def func(A,a1,b1,c1):
N=numpy.zeros(len(x))
for i in range(len(x)):
N[i]=1.0*erf(a1*(A[i,1]-c1*A[i,0]**b1))
return N
x0 = numpy.array([2.5 , -0.07 ,-5.0])
Q=optimization.curve_fit(func,A,cc, x0, Error)
And Error=[ErP,ErM] (2 columns)
Least squares algorithm like curve_fit or scipy.optimize.leastsq will not be able to do this because the loss function is quadratic, and so symmetric for positive and negative error.
I haven't seen any models for this and maybe PAIDA can handle it, as DanHickstein mentioned.
Otherwise, you could use the nonlinear optimizers like optimize.fmin and construct your own asymmetric loss function.
def loss_function(params, ...):
error = (y - func(x, params))
error_neg = (error < 0)
error_squared = error**2 / (error_neg * sigma_low + (1 - error_neg) * sigma_upp))
return error_squared.sum()
and minimize this with fmin or fmin_bfgs.
(I never tried this.)
In the current version, I am afraid it is not doable. curve_fit is a wrap around the popular Fortran library minipack. Check the source code of \scipy_install_path\optimize\minipack.py, you will see: (line 498-509):
if sigma is None:
func = _general_function
else:
func = _weighted_general_function
args += (1.0/asarray(sigma),)
Basically what it means is that of sigma is not provided, then the unweighted Levenberg-Marquardt method in minipack will be called. If sigma is provided, then the weighted LM will be called. That implies, if sigma is to be provided, it must be provided as a array of the same length of X or Y.
That means if you want to have asymmetric error residue on Y, you have to come up with some modification to your target function, as #Jaime suggested.
I'm not 100% sure, but it looks like the PAIDA package might do fits with asymmetric errors:
http://paida.sourceforge.net/documentation/fitter/index.html
A solution, which I've used frequently, is to draw realisations (say 100-1000) from a split-normal distribution, and run your fitting algorithm on each of these realisations with the error set to 0.0. You'll then have 100-1000 best-fitting parameters, from which you can simply take the median, along with any error estimate you want to use.
Related
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
Recently I wanted to demonstrate generating a continuous random variable using the universality of the Uniform. For that, I wanted to use the combination of numpy and matplotlib. However, the generated random variable seems a little bit off to me - and I don't know whether it is caused by the way in which NumPy's random uniform and vectorized works or if I am doing something fundamentally wrong here.
Let U ~ Unif(0, 1) and X = F^-1(U). Then X is a real variable with a CDF F (please note that the F^-1 here denotes the quantile function, I also omit the second part of the universality because it will not be necessary).
Let's assume that the CDF of interest to me is:
then:
According to the universality of the uniform, to generate a real variable, it is enough to plug U ~ Unif(0, 1) in the F-1. Therefore, I've written a very simple code snippet for that:
U = np.random.uniform(0, 1, 1000000)
def logistic(u):
x = np.log(u / (1 - u))
return x
logistic_transform = np.vectorize(logistic)
X = logistic_transform(U)
However, the result seems a little bit off to me - although the histogram of a generated real variable X resembles a logistic distribution (which simplified CDF I've used) - the r.v. seems to be distributed in a very unequal way - and I can't wrap my head around exactly why it is so. I would be grateful for any suggestions on that. Below are the histograms of U and X.
You have a large sample size, so you can increase the number of bins in your histogram and still get a good number samples per bin. If you are using matplotlib's hist function, try (for exampe) bins=400. I get this plot, which has the symmetry that I think you expected:
Also--and this is not relevant to the question--your function logistic will handle a NumPy array without wrapping it with vectorize, so you can save a few CPU cycles by writing X = logistic(U). And you can save a few lines of code by using scipy.special.logit instead of implementing it yourself.
I am currently working with fitting decline curves to real-world production data. I have had good luck with creating a hyperbolic and using curve_fit from scipy.optimize. The current function I use:
def hyp_func(x,qi,b,di):
return qi*(1.0-b*di*x)**(-1.0/b)
What I would like to do now, is at a certain rate of decline, transition to an exponential function. How would i go about this and still be able to use in curve_fit (I think below works)? I am trying the code below, is this the way to do it? or is there a better way?
def hyp_func2(x,qi,b,di):
dlim = -0.003
hy = qi*(1.0-b*di*x)**(-1.0/b)
hdy = di/(1.0-b*di*x)
ex = x[hdy>dlim]
qlim = qi*(dlim/di)**(1/b)
xlim = ((qi/qlim)**b-1)/(b*-di)
ey = qlim*np.exp(dlim*(ex-xlim))
y = np.concatenate((hy[hdy<dlim],ey))
return y
hy is the hyperbolic equation
hdy is the hy derivative
ex is the part of x after derivative hits dlim
ey is the exponential equation
I am still working out the equations, I am not getting a continuous function.
edit: data here, and updated equations
Sorry to be the bearer of bad news, but if I understand what you are trying to do, I think it is very difficult to have scipy.optimize.curve_fit, or any of the other methods from scipy.optimize do what you are hoping to do.
Most fitting algorithms are designed to work with continuous variables, and usually (and curve_fit for sure) start off by making very small changes in parameter values to find the right direction and step size to take to improve the result.
But what you're looking for is a discrete variable as the breakpoint between one functional form (roughly, "power law") to another ("exponential") The algorithm won't normally make a large enough change in your di parameter to make a difference for which value is used as that breakpoint, and may decide that di does not affect the fit (your model used di in other ways too, so you might get lucky and di might have an affect on the fit.
Assuming that qi>0 the slope is actually positive, so I do not get the choice of -0.003. Moreover I think the derivative is wrong.
You can calculate exactly the value where the lope reaches a critical value.
Now, from my experience you have two choices. If you define a piecewise function yourself, you usually run into trouble with function calls using numpy arrays. I typically use scipy.optimize.leastsq with a self-defined residual function. A second option is a continuous transition between the two functions. You can make that as sharp as you want, as value and slope already fit, by definition.
The two solutions look as follows
import matplotlib.pyplot as plt
import numpy as np
def hy(x,b,qi,di):
return qi*(1.0-b*di*x)**(-1.0/b)
def abshy(x,b,qi,di):#same as hy but defined for all x
return qi*abs(1.0-b*di*x)**(-1.0/b)
def dhy(x,b,qi,di):#derivative of hy
return qi*di*(1.0-b*di*x)**(-(b+1.0)/b)
def get_x_from_slope(s,b,qi,di):#self explaining
return (1.0-(s/(qi*di))**(-b/(b+1.0)))/(b*di)
def exh(x,xlim,qlim,dlim):#exponential part (actually no free parameters)
return qlim*np.exp(dlim*(x-xlim))
def trans(x,b,qi,di, s0):#piecewise function
x0=get_x_from_slope(s0,b,qi,di)
if x<x0:
out= hy(x,b,qi,di)
else:
H0=hy(x0,b,qi,di)
out=exh(x,x0,H0,s0/H0)
return out
def no_if_trans(x,b,qi,di, s0,sharpness=10):#continuous transition between the two functions
x0=get_x_from_slope(s0,b,qi,di)
H0=hy(x0,b,qi,di)
weight=0.5*(1+np.tanh(sharpness*(x-x0)))
return weight*exh(x,x0,H0,s0/H0)+(1.0-weight)*abshy(x,b,qi,di)
xList=np.linspace(0,5.5,90)
hyList=np.fromiter(( hy(x,2.2,1.2,.1) for x in xList ) ,np.float)
t1List=np.fromiter(( trans(x,2.2,1.2,.1,3.59) for x in xList ) ,np.float)
nt1List=np.fromiter(( no_if_trans(x,2.2,1.2,.1,3.59) for x in xList ) ,np.float)
fig1=plt.figure(1)
ax=fig1.add_subplot(1,1,1)
ax.plot(xList,hyList)
ax.plot(xList,t1List,linestyle='--')
ax.plot(xList,nt1List,linestyle=':')
ax.set_ylim([1,10])
ax.set_yscale('log')
plt.show()
There is almost no differences in the two solutions, but your options for using scipy fitting functions are slightly different. The second solution should easily work with curve_fit
I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.
Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.
My code:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]; num2 = pop2.shape[0];
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
Update:
After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.
With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).
Second update:
While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.
Bonus question :)
(More statistical theory related; feel free to ignore)
Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.
So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?
By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.
I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.
As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).
You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.
This should give you your expected result:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]
num2 = pop2.shape[0];
var1 = np.var(pop1, ddof=1)
var2 = np.var(pop2, ddof=1)
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1))
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.
For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.
Looks like you forgot **2 to the numerator of your df. The Welch-Satterthwaite degrees of freedom.
df = (np.var(pop1)/num1 + np.var(pop2)/num2)/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
should be:
df = (np.var(pop1)/num1 + np.var(pop2)/num2)**2/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
I have been doing some Monte Carlo physics simulations with Python and I am in unable to determine the standard error for the coefficients of a non-linear least square fit.
Initially, I was using SciPy's scipy.stats.linregress for my model since I thought it would be a linear model but noticed it is actually some sort of power function. I then used NumPy's polyfit with the degrees of freedom being 2 but I can't find anyway to determine the standard error of the coefficients.
I know gnuplot can determine the errors for me but I need to do fits for over 30 different cases. I was wondering if anyone knows of anyway for Python to read the standard error from gnuplot or is there some other library I can use?
Finally found the answer to this long asked question! I'm hoping this can at least save someone a few hours of hopeless research for this topic. Scipy has a special function called curve_fit under its optimize section. It uses the least square method to determine the coefficients and best of all, it gives you the covariance matrix. The covariance matrix contains the variance of each coefficient. More exactly, the diagonal of the matrix is the variance and by square rooting the values, the standard error of each coefficient can be determined! Scipy doesn't have much documentation for this so here's a sample code for a better understanding:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
def func(x,a,b,c):
return a*x**2 + b*x + c #Refer [1]
x = np.linspace(0,4,50)
y = func(x,2.6,2,3) + 4*np.random.normal(size=len(x)) #Refer [2]
coeff, var_matrix = curve_fit(func,x,y)
variance = np.diagonal(var_matrix) #Refer [3]
SE = np.sqrt(variance) #Refer [4]
#======Making a dictionary to print results========
results = {'a':[coeff[0],SE[0]],'b':[coeff[1],SE[1]],'c':[coeff[2],SE[2]]}
print "Coeff\tValue\t\tError"
for v,c in results.iteritems():
print v,"\t",c[0],"\t",c[1]
#========End Results Printing=================
y2 = func(x,coeff[0],coeff[1],coeff[2]) #Saves the y values for the fitted model
plot.plot(x,y)
plot.plot(x,y2)
plot.show()
What this function returns is critical because it defines what will used to fit for the model
Using the function to create some arbitrary data + some noise
Saves the covariance matrix's diagonal to a 1D matrix which is just a normal array
Square rooting the variance to get the standard error (SE)
it looks like gnuplot uses levenberg-marquardt and there's a python implementation available - you can get the error estimates from the mpfit.covar attribute (incidentally, you should worry about what the error estimates "mean" - are other parameters allowed to adjust to compensate, for example?)