Suppose we're given a prior on X (e.g. X ~ Gaussian) and a forward operator y = f(x). Suppose further we have observed y by means of an experiment and that this experiment can be repeated indefinitely. The output Y is assumed to be Gaussian (Y ~ Gaussian) or noise-free (Y ~ Delta(observation)).
How to consistently update our subjective degree of knowledge about X given the observations? I've tried the following model with PyMC, but it seems I'm missing something:
from pymc import *
xtrue = 2 # this value is unknown in the real application
x = rnormal(0, 0.01, size=10000) # initial guess
for i in range(5):
X = Normal('X', x.mean(), 1./x.var())
Y = X*X # f(x) = x*x
OBS = Normal('OBS', Y, 0.1, value=xtrue*xtrue+rnormal(0,1), observed=True)
model = Model([X,Y,OBS])
mcmc = MCMC(model)
x = mcmc.trace('X')[:] # posterior samples
The posterior is not converging to xtrue.
The functionality purposed by #user1572508 is now part of PyMC under the name stochastic_from_data() or Histogram(). The solution to this thread then becomes:
from pymc import *
import matplotlib.pyplot as plt
xtrue = 2 # unknown in the real application
prior = rnormal(0,1,10000) # initial guess is inaccurate
for i in range(5):
x = stochastic_from_data('x', prior)
y = x*x
obs = Normal('obs', y, 0.1, xtrue*xtrue + rnormal(0,1), observed=True)
model = Model([x,y,obs])
mcmc = MCMC(model)
prior = mcmc.trace('x')[:]
The problem is that your function, $y = x^2$, is not one-to-one. Specifically, because you lose all information about the sign of X when you square it, there is no way to tell from your Y values whether you originally used 2 or -2 to generate the data. If you create a histogram of your trace for X after just the first iteration, you will see this:
This distribution has 2 modes, one at 2 (your true value) and one at -2. At the next iteration, x.mean() will be close to zero (averaging over the bimodal distribution), which is obviously not what you want.
I was trying to adopt this solution proposed in this thread to determine the parameters of a simple normal distribution. Even though the modifications are minor (based on wikipedia), the result is pretty off. Any suggestion where it goes wrong?
import math
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def gaussian(x, mu, sig):
return 1./(math.sqrt(2.*math.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
def lik(parameters):
mu = parameters[0]
sigma = parameters[1]
n = len(x)
L = n/2.0 * np.log(2 * np.pi) + n/2.0 * math.log(sigma **2 ) + 1/(2*sigma**2) * sum([(x_ - mu)**2 for x_ in x ])
return L
mu0 = 10
sigma0 = 2
x = np.arange(1,20, 0.1)
y = gaussian(x, mu0, sigma0)
lik_model = minimize(lik, np.array([5,5]), method='L-BFGS-B')
mu = lik_model['x'][0]
sigma = lik_model['x'][1]
print lik_model
plt.plot(x, gaussian(x, mu, sigma), label = 'fit')
plt.plot(x, y, label = 'data')
Output of the fit:
jac: array([2.27373675e-05, 2.27373675e-05])
success: True
x: array([10.45000245, 5.48475283])
The maximum likelihood method is for fitting the parameters of a distribution to a set of values that are purportedly a random sample from that distribution. In your lik function, you use x to hold the sample, but x is a global variable that you have set to x = np.arange(1,20, 0.1). That is definitely not a random sample from a normal distribution.
Because you are using the normal distribution, you can use the known formulas for the maximum likelihood estimate to check your computation: mu is the sample mean, and sigma is the sample standard deviation:
In [17]: x.mean()
Out[17]: 10.450000000000006
In [18]: x.std()
Out[18]: 5.484751589634671
Those value matches the result of your call to minimize pretty closely, so it looks like your code is working.
To modify your code to use MLE in the way you expected it to work, x should be a collection of values that are purportedly a random sample from a normal distribution. Note that your array y is not such a sample. It is the value of the probability density function (PDF) on a grid. If fitting the distribution to a sample of the PDF is your actual goal, you can use an curve-fitting function such as scipy.optimize.curve_fit.
If fitting the normal distribution parameters to a random sample is, in fact, what you want to do, then to test your code, you should use an input that is a reasonably large sample from a distribution with known parameters. In this case, you can do
x = np.random.normal(loc=mu0, scale=sigma0, size=20)
When I use such an x in your code, I get
In [20]: lik_model.x
Out[20]: array([ 9.5760996 , 2.01946582])
As expected, the values in the solution are approximately 10 and 2.
(If you use x for your sample as I did, you'll have to change your
plotting code accordingly.)
I am aware that following will require patience and I do appreciate the effort you will be giving.
I have a measured data, which represent the derivative of the magnetic moment : dM/dH. A good mathematical model of M(H) curve is the langevin function : where:
M(H) = 1/coth(xi) - 1/xi , xi = cte*Vi³
so the derivative of the magnetic moment can be obtained from the derivative of the derivative of the langevin function :
dM/dH = 1/xi² - 1/(sinh²(xi))
For the fitting I used this function as a fitting function :
def langevinDeriv(xx):
if not hasattr(xx, '__iter__'):
xx = [ xx ]
res = np.zeros(len(xx))
eps = 1e-1
for i in range(len(xx)):
x = xx[i]
if np.fabs(x) < eps:
res[i] = 1./3. - x**2/15. + 2.* x**4 / 189. - x**6/675. + 2.* x**8 / 10395. - 1382. * x**10 / 58046625. + 4. * x**12 / 1403325.
res[i] = (1./x**2 - 1./np.sinh(x)**2)
return res
and minimized the error with a simple Least square function.
Here is what I got : comparaison : fit and data
I would say, that the fit is not good, because actually I don't have one diameter of particles but polydisperse ensembles with different diameters and so with different Langevin_derivative functions.
My question is, how can I integrate this probability density for the diameter to my fitting function, so that the program would fit to a probability distribution and not a single Diameter Vi. The function of the probability density is given here:
So I fiddled around bit. As mentioned in the comments, fit will never give super results as the model does not capture the drop in signal at the ends (as well as the step-like behaviour on the graph). The results, however looks much better than a simple Langevin derivative. I basically sum up functions with different particle volume providing a max diameter. You can control the max diameter and the number of diameters used in the range of 0 to max diameter. The only two fit parameters are the standard deviation and the overall amplitude. In detail you have to be careful with the scaling to get physically meaningful results. I played already a little with n and d_max finding that in my scaling 15,3 is OK. I guess d_max should be sufficiently larger than s and n reasonably large to have several values near the max of the log-normal distribution.
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
from scipy.optimize import curve_fit ,leastsq
def log_gauss(x,s):
if x==0 or s==0:
if abs(exponent)>100:
out=np.exp(exponent)/np.sqrt(2 * np.pi * x**2 * s**2)
return out
def langevin(x,epsilon=1e-4):
if abs(x)<epsilon:
return out
def langevin_d(x,epsilon=1e-4):
if abs(x)<epsilon:
elif abs(x)>100.:
out= 1./x**2
return out
def langevin_d_distributed(h,s,n=25,dMax=10):
pdiaList=[log_gauss(d,s) for d in diaList]
volList=[d**3 for d in diaList]
for v,p in zip(volList,pdiaList):
return dm
def residuals(parameters,dataPoint,n=25,dMax=10):
a,s = abs(parameters)
dist = [y -a*langevin_d_distributed(x,s,n=n,dMax=dMax) for x,y in dataPoint]
return dist
meas_x,meas_y=np.loadtxt('OBaPH.txt', delimiter=',',unpack=True)
langevinDList=[langevin_d(h) for h in hList]
distList_01=[langevin_d_distributed(h,.29) for h in hList]
estimate = [1,0.29]
for nnn,ddd in [(15,3),(15,1.5),(15,10),(5,3),(25,3)]:
bestFitValues[(nnn,ddd)], ier = leastsq(residuals, estimate,args=(dataTupel,nnn,ddd))
print bestFitValues[(nnn,ddd)]
myFit[(nnn,ddd)]= [bestFitValues[(nnn,ddd)][0]*langevin_d_distributed(h,bestFitValues[(nnn,ddd)][1],n=nnn,dMax=ddd) for h in hList]
ax.plot(meas_x,meas_y,linestyle='',marker='o',label='rescaled data')
ax.plot(hList,distList_01,label='log_norm test')
for key,val in myFit.iteritems():
I have a differential equation of the form
dy(x)/dx = f(y,x)
that I would like to solve for y.
I have an array xs containing all of the values of x for which I need ys.
For only those values of x, I can evaluate f(y,x) for any y.
How can I solve for ys, preferably in python?
import numpy as np
# these are the only x values that are legal
xs = np.array([0.15, 0.383, 0.99, 1.0001])
# some made up function --- I don't actually have an analytic form like this
def f(y, x):
if not np.any(np.isclose(x, xs)):
return np.nan
return np.sin(y + x**2)
# now I want to know which array of ys satisfies dy(x)/dx = f(y,x)
Assuming you can use something simple like Forward Euler...
Numerical solutions will rely on approximate solutions at previous times. So if you want a solution at t = 1 it is likely you will need the approximate solution at t<1.
My advice is to figure out what step size will allow you to hit the times you need, and then find the approximate solution on an interval containing those times.
import numpy as np
#from your example, smallest step size required to hit all would be 0.0001.
a = 0 #start point
b = 1.5 #possible end point
h = 0.0001
N = float(b-a)/h
y = np.zeros(n)
t = np.linspace(a,b,n)
y[0] = 0.1 #initial condition here
for i in range(1,n):
y[i] = y[i-1] + h*f(t[i-1],y[i-1])
Alternatively, you could use an adaptive step method (which I am not prepared to explain right now) to take larger steps between the times you need.
Or, you could find an approximate solution over an interval using a coarser mesh and interpolate the solution.
Any of these should work.
I think you should first solve ODE on a regular grid, and then interpolate solution on your fixed grid. The approximate code for your problem
import numpy as np
from scipy.integrate import odeint
from scipy import interpolate
xs = np.array([0.15, 0.383, 0.99, 1.0001])
# dy/dx = f(x,y)
def dy_dx(y, x):
return np.sin(y + x ** 2)
y0 = 0.0 # init condition
x = np.linspace(0, 10, 200)# here you can control an accuracy
sol = odeint(dy_dx, y0, x)
f = interpolate.interp1d(x, np.ravel(sol))
ys = f(xs)
But dy_dx(y, x) should always return something reasonable (not np.none).
Here is the drawing for this case
I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.linalg.pinv(Kern), Y )
Y_pred = Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred =, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
cntr += 1
prevPhase = phs # Update guess for phase of sine curve
I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.title("FFT Guess : Final Parameters: %s"%sol)
The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.
Without understanding much of your code, according to
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!
By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?
You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )