I am having trouble computing a likelihood ratio test in Python 2.7.
I have two models and the corresponding likelihood values. I believe the rule for comparing whether model L2 is better than model L1 (if the models are closely related) is to look at -2 * log(L2/L1).
I then want to find the p-value for corresponding to -2 * log(L2/L1) and relate this to the significance for L2 is preferred to L1. Here is what I have so far:
import numpy as np
from scipy.stats import chisqprob
L1 = 467400. # log(likelihood) of my 1st fit
L2 = 467414. # log(likelihood) of my 2nd fit
LR = -2. * np.log(L2 / L1) # LR = -5.9905e-05
p = chisqprob(LR, 1) # L2 has 1 DoF more than L1
print 'p: %.30f' % p # p = 1.000000000000000000000000000000
five_sigma = 1 - scipy.special.erf(5 / np.sqrt(2.)) :-)
print '5 sigma: %.30f' % five_sigma
five_sigma_check = 1 - 0.999999426696856 :-(
print 'Check : %.30f' % five_sigma_check
However, I run into two issues:
My p-value is coming out to be 1 when I'd have expected it to be close to 0.
When I use the formula on the line marked with the :-) to find five sigma, for example, it differs from the value quoted in the literature - that line is highlighted with a :-(. My value for five_sigma_check is taken from here.
Can anyone offer any advice, please? I'm relatively new to the world of Python and statistics.
Thanks.
To calculate the likelihood ratio given the log-likelihoods, use this formula:
from scipy.stats.distributions import chi2
def likelihood_ratio(llmin, llmax):
return(2*(llmax-llmin))
LR = likelihood_ratio(L1,L2)
p = chi2.sf(LR, 1) # L2 has 1 DoF more than L1
print 'p: %.30f' % p
# p: 0.000000121315450836607258011741
Adding a little bit theory, so that it will be helpful for someone to understand:
Here we have two models (assuming nested) and we want to compare the effectiveness of the models in terms of explaining the data. From the above figure, let's now implement the LR test with python 3:
from scipy.stats import chi2
ll_0, ll_1 = 467400, 467414 # given, the log-likelihoods of the nested models m_0, m_1
# log likelihood for m_0 (H_0) must be <= log likelihood of m_1 (H_1)
Λ = -2 * (ll_0 - ll_1)
print(Λ)
# 28.0
df = 1 # given the difference in dof
# compute the p-value
pvalue = 1 - chi2(df).cdf(Λ) # since Λ follows χ2
print(pvalue)
# 1.2131545083660726e-07
We can plot and clearly see that we can reject the NULL hypothesis in favor of model m1, at α=0.05.
α, df = 0.05, 1
x = np.linspace(0, 30, 1000)
plt.plot(x, chi2(df).pdf(x), label='χ2')
plt.axvline(chi2(df).ppf(1-α), color='red', label='α=0.05')
plt.scatter(Λ, 0, color='green', s=50, label='Λ')
plt.legend()
plt.title('χ2({}) distribution'.format(df), size=20)
Related
I am trying to fit some data using scipy.optimize.curve_fit. I have read the documentation and also this StackOverflow post, but neither seem to answer my question.
I have some data which is simple, 2D data which looks approximately like a trig function. I want to fit it with a general trig function
using scipy.
My approach is as follows:
from __future__ import division
import numpy as np
from scipy.optimize import curve_fit
#Load the data
data = np.loadtxt('example_data.txt')
t = data[:,0]
y = data[:,1]
#define the function to fit
def func_cos(t,A,omega,dphi,C):
# A is the amplitude, omega the frequency, dphi and C the horizontal/vertical shifts
return A*np.cos(omega*t + dphi) + C
#do a scipy fit
popt, pcov = curve_fit(func_cos, t,y)
#Plot fit data and original data
fig = plt.figure(figsize=(14,10))
ax1 = plt.subplot2grid((1,1), (0,0))
ax1.plot(t,y)
ax1.plot(t,func_cos(t,*popt))
This outputs:
where blue is the data orange is the fit. Clearly I am doing something wrong. Any pointers?
If no values are provided for initial guess of the parameters p0 then a value of 1 is assumed for each of them. From the docs:
p0 : array_like, optional
Initial guess for the parameters (length N). If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
Since your data has very large x-values and very small y-values an initial guess of 1 is far from the actual solution and hence the optimizer does not converge. You can help the optimizer by providing suitable initial parameter values that can be guessed / approximated from the data:
Amplitude: A = (y.max() - y.min()) / 2
Offset: C = (y.max() + y.min()) / 2
Frequency: Here we can estimate the number of zero crossing by multiplying consecutive y-values and check which products are smaller than zero. This number divided by the total x-range gives the frequency and in order to get it in units of pi we can multiply that number by pi: y_shifted = y - offset; oemga = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min())
Phase shift: can be set to zero, dphi = 0
So in summary, the following initial parameter guess can be used:
offset = (y.max() + y.min()) / 2
y_shifted = y - offset
p0 = (
(y.max() - y.min()) / 2,
np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min()),
0,
offset
)
popt, pcov = curve_fit(func_cos, t, y, p0=p0)
Which gives me the following fit function:
I would like to sample from an arbitrary function in Python.
In Fast arbitrary distribution random sampling it was stated that one could use inverse transform sampling and in Pythonic way to select list elements with different probability it was mentioned that one should use inverse cumulative distribution function. As far as I undestand those methods only work the univariate case. My function is multivariate though and too complex that any of the suggestions in https://stackoverflow.com/a/48676209/4533188 would apply.
Prinliminaries: My function is based on Rosenbrock's banana function, which value we can get the value of the function with
import scipy.optimize
scipy.optimize.rosen([1.1,1.2])
(here [1.1,1.2] is the input vector) from scipy, see https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.rosen.html.
Here is what I came up with: I make a grid over my area of interest and calculate for each point the function value. Then I sort the resulting data frame by the value and make a cumulative sum. This way we get "slots" which have different sizes - points which have large function values have larger slots than points with small function values. Now we generate random values and look into which slot the random value falls into. The row of the data frame is our final sample.
Here is the code:
import scipy.optimize
from itertools import product
from dfply import *
nb_of_samples = 50
nb_of_grid_points = 30
rosen_data = pd.DataFrame(array([item for item in product(*[linspace(fm[0], fm[1], nb_of_grid_points) for fm in zip([-2,-2], [2,2])])]), columns=['x','y'])
rosen_data['z'] = [np.exp(-scipy.optimize.rosen(row)**2/500) for index, row in rosen_data.iterrows()]
rosen_data = rosen_data >> \
arrange(X.z) >> \
mutate(z_upperbound=cumsum(X.z)) >> \
mutate(z_upperbound=X.z_upperbound/np.max(X.z_upperbound))
value = np.random.sample(1)[0]
def get_rosen_sample(value):
return (rosen_data >> mask(X.z_upperbound >= value) >> select(X.x, X.y)).iloc[0,]
values = pd.DataFrame([get_rosen_sample(s) for s in np.random.sample(nb_of_samples)])
This works well, but I don't think it is very efficient. What would be a more efficient solution to my problem?
I read that Markov chain Monte Carlo might helping, but here I am in over my head for now on how to do this in Python.
I was in a similar situation, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method) to sample from a bivariate distribution. An example follows.
Say, we want to sample from the following denisty:
def density1(z):
z = np.reshape(z, [z.shape[0], 2])
z1, z2 = z[:, 0], z[:, 1]
norm = np.sqrt(z1 ** 2 + z2 ** 2)
exp1 = np.exp(-0.5 * ((z1 - 2) / 0.8) ** 2)
exp2 = np.exp(-0.5 * ((z1 + 2) / 0.8) ** 2)
u = 0.5 * ((norm - 4) / 0.4) ** 2 - np.log(exp1 + exp2)
return np.exp(-u)
which looks like this
The following function implements MH with multivariate normal as the proposal
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
Run MH and plot samples
samples = metropolis_hastings(density1)
plt.hexbin(samples[:,0], samples[:,1], cmap='rainbow')
plt.gca().set_aspect('equal', adjustable='box')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.show()
Check out this repo of mine for details.
I have a nonlinear model fit that looks like this:
The dark solid line is the model fit, and the grey part is the raw data.
Short version of the question: how do I get the likelihood of this model fit, so I can perform log-likelihood ratio test? Assume that the residual is normally distributed.
I am relatively new to statistics, and my current thoughts are:
Get the residual from the curve fit, and calculate the variance of residual;
Use this equation
And plug in the variance of residual into sigma-squared, x_i as experiment and mu as model fit;
Calculate the log-likelihood ratio.
Could anyone help me, with these two full-version questions?
Is my method correct? (I think so, but it would be really great to make sure!)
Are there any ready-made functions in python/scipy/statsmodels to do this for me?
Your likelihood function
which is simply the sum of log of probability density function of Gaussian distribution.
is the likelihood of fitting a mu and a sigma for your residue, not the likelihood of your model given your data. In one word, your approach is wrong.
Sine you are doing non-linear least square, following what #usethedeathstar already mentioned, you should go straight for F-test. . Consider the following example, modified from http://www.walkingrandomly.com/?p=5254, and we conduct F-test using R. And we will discuss how to translate it into python in the end.
# construct the data vectors using c()
> xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9)
> ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
# some starting values
> p1 = 1
> p2 = 0.2
> p3 = 0.01
# do the fit
> fit1 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2))
> fit2 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata)+p3*xdata, start=list(p1=p1,p2=p2,p3=p3))
# summarise
> summary(fit1)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.881851 0.027430 68.61 2.27e-12 ***
p2 0.700230 0.009153 76.51 9.50e-13 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08202 on 8 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.189e-06
> summary(fit2)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.90108 0.03520 54.002 1.96e-10 ***
p2 0.70657 0.01167 60.528 8.82e-11 ***
p3 0.02029 0.02166 0.937 0.38
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08243 on 7 degrees of freedom
Number of iterations to convergence: 9
Achieved convergence tolerance: 2.476e-06
> anova(fit2, fit1)
Analysis of Variance Table
Model 1: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Model 2: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 7 0.047565
2 8 0.053813 -1 -0.0062473 0.9194 0.3696
here we have two model, fit1 has 2 parameters, therefore the residue has 8 degrees-of-freedom; fit2 has one additional parameter and the residue has 7 degrees of freedom. Is model 2 significantly better? No, the F value is 0.9194, on (1,7) degrees of freedom and it is not significant.
To get the ANOVA table: Residue DF is easy. Residue Sum of squares: 0.08202*0.08202*8=0.05381 and 0.08243*0.08243*7=0.04756293 (notice: 'Residual standard error: 0.08243 on 7 degrees of freedom', etc). In python, you can get it by (y_observed-y_fitted)**2, since scipy.optimize.curve_fit() doesn't return the residues.
The F-ratio is 0.0062473/0.047565*7 and to get P-value: 1-scipy.stats.f.cdf(0.9194, 1, 7).
Put them together we have python equivalent:
In [1]:
import scipy.optimize as so
import scipy.stats as ss
xdata = np.array([-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9])
ydata = np.array([0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001])
def model0(x,p1,p2):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)
def model1(x,p1,p2,p3):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)+p3*x
p1, p2, p3 = 1, 0.2, 0.01
fit0=so.curve_fit(model0, xdata, ydata, p0=(p1,p2))[0]
fit1=so.curve_fit(model1, xdata, ydata, p0=(p1,p2,p3))[0]
yfit0=model0(xdata, fit0[0], fit0[1])
yfit1=model1(xdata, fit1[0], fit1[1], fit1[2])
ssq0=((yfit0-ydata)**2).sum()
ssq1=((yfit1-ydata)**2).sum()
df=len(xdata)-3
f_ratio=(ssq0-ssq1)/(ssq1/df)
p=1-ss.f.cdf(f_ratio, 1, df)
In [2]:
print f_ratio, p
0.919387419515 0.369574503394
As #usethedeathstar pointed out: when you the residue is normally distributed, nonlinear least square IS the maximum likelihood. Therefore F-test and likelihood ratio test is equivalent. Because, F-ratio is a monotone transformation of the likelihood ratio λ.
Or in a descriptive way, see: http://www.stata.com/support/faqs/statistics/chi-squared-and-f-distributions/
Your formula looks correct to me. It should give you the same results as scipy.stats.norm.logpdf(x, loc=mu, scale=sigma)
Since you already have your estimates of mu and sigma, I don't think there is a function for the likelihood ratio test where you can plug your results in.
If you have the estimates of two models, where one is nested in the other, then you can easily calculate it yourself.
http://en.wikipedia.org/wiki/Likelihood-ratio_test
Here is the part of a method in statsmodels that calculates the LR-test for comparing two nested linear models
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1531
I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
Code:
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
fig1.tight_layout()
plt.show()
cntr += 1
prevPhase = phs # Update guess for phase of sine curve
I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.figure(figsize=(15,10))
plt.subplot(211)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.subplot(212)
plt.title("FFT Guess : Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
plt.show()
The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
...
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
to
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.
Without understanding much of your code, according to http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!
By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?
You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12):
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )
Suppose we're given a prior on X (e.g. X ~ Gaussian) and a forward operator y = f(x). Suppose further we have observed y by means of an experiment and that this experiment can be repeated indefinitely. The output Y is assumed to be Gaussian (Y ~ Gaussian) or noise-free (Y ~ Delta(observation)).
How to consistently update our subjective degree of knowledge about X given the observations? I've tried the following model with PyMC, but it seems I'm missing something:
from pymc import *
xtrue = 2 # this value is unknown in the real application
x = rnormal(0, 0.01, size=10000) # initial guess
for i in range(5):
X = Normal('X', x.mean(), 1./x.var())
Y = X*X # f(x) = x*x
OBS = Normal('OBS', Y, 0.1, value=xtrue*xtrue+rnormal(0,1), observed=True)
model = Model([X,Y,OBS])
mcmc = MCMC(model)
mcmc.sample(10000)
x = mcmc.trace('X')[:] # posterior samples
The posterior is not converging to xtrue.
The functionality purposed by #user1572508 is now part of PyMC under the name stochastic_from_data() or Histogram(). The solution to this thread then becomes:
from pymc import *
import matplotlib.pyplot as plt
xtrue = 2 # unknown in the real application
prior = rnormal(0,1,10000) # initial guess is inaccurate
for i in range(5):
x = stochastic_from_data('x', prior)
y = x*x
obs = Normal('obs', y, 0.1, xtrue*xtrue + rnormal(0,1), observed=True)
model = Model([x,y,obs])
mcmc = MCMC(model)
mcmc.sample(10000)
Matplot.plot(mcmc.trace('x'))
plt.show()
prior = mcmc.trace('x')[:]
The problem is that your function, $y = x^2$, is not one-to-one. Specifically, because you lose all information about the sign of X when you square it, there is no way to tell from your Y values whether you originally used 2 or -2 to generate the data. If you create a histogram of your trace for X after just the first iteration, you will see this:
This distribution has 2 modes, one at 2 (your true value) and one at -2. At the next iteration, x.mean() will be close to zero (averaging over the bimodal distribution), which is obviously not what you want.