Related
Curve_fit is not fit properly. I'm trying to fit experimental data with curve_fit. The data is imported from a .txt file to a array:
d = np.loadtxt("data.txt")
data_x = np.array(d[:, 0])
data_y = np.array(d[:, 2])
data_y_err = np.array(d[:, 3])
Since i know there must be two peaks, my model is a sum of two gaussian curves:
def model_dGauss(x, xc, A, y0, w, dx):
P = A/(w*np.sqrt(2*np.pi))
mu1 = (x - (xc - dx/3))/(2*w**2)
mu2 = (x - (xc + 2*dx/3))/(2*w**2)
return y0 + P * ( np.exp(-mu1**2) + 0.5 * np.exp(-mu2**2))
Using values for the guess is very sensitive to my guess values. Where is the point of fitting data if just nearly perfect fitting parameter will provide a result? Or am I doing something completely wrong?
t = np.linspace(8.4, 10, 300)
guess_dG = [32, 1, 10, 0.1, 0.2]
popt, pcov = curve_fit(model_dGauss, data_x, data_y, p0=guess_dG, sigma=data_y_err, absolute_sigma=True)
A, xc, y0, w, dx = popt
Plotting the data
plt.scatter(data_x, data_y)
plt.plot(t, model_dGauss(t1,*popt))
plt.errorbar(data_x, data_y, yerr=data_y_err)
yields:
Plot result
The result is just a straight line at the bottom of my graph while the evaluated parameters are not that bad. How can that be?
A complete example of code is always appreciated (and, ahem, usually expected here on SO). To remove much of the confusion about using curve_fit here, allow me to suggest that you will have an easier time using lmfit (https://lmfit.github.io/lmfit-py) and especially its builtin model functions and its use of named parameters. With lmfit, your code for two Gaussians plus a constant offset might look like this:
from lmfit.models import GaussianModel, ConstantModel
# start with 1 Gaussian + Constant offset:
model = GaussianModel(prefix='p1_') + ConstantModel()
# this model will have parameters named:
# p1_amplitude, p1_center, p1_sigma, and c.
# here we give initial values to these parameters
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10)
# optionally place bounds on parameters (probably not needed here):
params['p1_amplitude'].min = 0.
## params['p1_center'].vary = False # fix a parameter from varying in fit
# now do the fit (including weighting residual by 1/y_err):
result = model.fit(data_y, params, x=data_x, weights=1.0/data_y_err)
# print out param values, uncertainties, and fit statistics, or get best-fit
# parameters from `result.params`
print(result.fit_report())
# plot results
plt.errorbar(data_x, data_y, yerr=data_y_err, label='data')
plt.plot(data_x, result.best_fit, label='best fit')
plt.legend()
plt.show()
To add a second Gaussian, you could just do
model = GaussianModel(prefix='p1_') + GaussianModel(prefix='p2_') + ConstantModel()
# and then:
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10,
p2_amplitude=2, p2_center=31.75, p1_sigma=0.5)
and so on.
Your model has the two gaussian sharing or at least having "linked" values - the sigma values should be the same for the two peaks and the amplitude of the 2nd is half that of the 1st. As defined so far, the 2-Gaussian model has all the parameters being independent. But lmfit has a mechanism for setting constraints on any parameter by giving an algebraic expression in terms of other parameters. So, for example, you could say
params['p2_sigma'].expr = 'p1_sigma'
params['p2_amplitude'].expr = 'p1_amplitude / 2.0'
Now, p2_amplitude and p2_sigma will not be independently varied in the fit but will be constrained to have those values.
I'm modeling measurement errors in a certain measuring device. This is how the data looks: high frequency sine ripples on a low frequency polynomial. My model should capture the ripples too.
The curve that fits the error should be of the form: error(x) = a0 + a1*x + a2*x^2 + ... an*x^n + Asin(x/lambda). The order n of the polynomial is not known. My plan is to iterate n from 1-9 and select the one that has the highest F-value.
I've played with numpy.polyfit and scipy.optimize.curve_fit so far. numpy.polyfit is only for polynomials, so while I can generate the "best fit" polynomial, there's no way to determine the parameters A and lambda for the sine term. scipy.optimize.curve_fit would have worked great if I already knew the order of the polynomial for the polynomial part of error(x).
Is there a clever way to use both numpy.polyfit and scipy.optimize.curve_fit to get this done? Or another library-function perhaps?
Here's the code for how I'm using numpy.polyfit to select the best polynomial:
def GetErrorPolynomial(X, Y):
maxFval = 0.0
for i in range(1, 10): # i is the order of the polynomial (max order = 9)
error_func = np.polyfit(X, Y, i)
error_func = np.poly1d(error_func)
# F-test (looking for the largest F value)
numerator = np.sum(np.square(error_func(X) - np.mean(Y))) / i
denominator = np.sum(np.square(Y - error_func(X))) / (Y.size - i - 1)
Fval = numerator / denominator
if Fval > maxFval:
maxFval = Fval
maxFvalPolynomial = error_func
return maxFvalPolynomial
And here's the code for how I'm using curve_fit:
def poly_sine_fit(x, a, b, c, d, l):
return a*np.square(x) + b*x + c + d*np.sin(x/l)
param, _ = curve_fit(poly_sine_fit, x_data, y_data)
It's "hardcoded" to a quadratic function, but I want to select the "best" order as I'm doing above with np.polyfit
I finally found a way to model the ripples and can answer my own question. This 2006 paper does curve-fitting on ripples that resemble my dataset.
First off, I did a least squares polynomial fit and then subtracted this polynomial curve from the original data. This left me with only the ripples. Applying the Fourier transform, I picked out the dominant frequencies which let me reconstruct the sine ripples. Then I simply added these ripples to the polynomial curve I had obtained in the beginning. That did it.
Use Scikit-learn Linear Regression
Here is a code sample I used to perform a linear regression with a polynom of degree 3 that pass by the point 0 with value 1 and null derivative. You just have to adapt the function create_vector with the function you want.
from sklearn import linear_model
import numpy as np
def create_vector(x):
# currently representing a polynom Y = a*X^3 + b*X^2
x3 = np.power(x, 3)
x2 = np.power(x, 2)
X = np.append(x3, x2, axis=1)
return X
data_x = [some_data_input]
data_y = [some_data_output]
x = np.array(data_x).reshape(-1, 1)
y_data = np.array(data_y).reshape(-1, 1)-1 # -1 to pass by the point (0,1)
X = create_vector(x)
regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(X, y_data)
I extracted data from the scatterplot for analysis and found that a polynomial + sine did not seem to be an optimal model, because lower order polynomials were not following the shape of the data very well and higher order polynomials were exhibiting Runge's phenomenon of high curvature at the data extremes. I performed an equation search to find what the high-frequency sine wave might be imposed upon, and a good candidate seemed to be the Extreme Value peak equation "a * exp(-1.0 * exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset" as shown below.
Here is a graphical Python curve fitter for this equation, at the top of the file I load the data I had extracted so you would need to replace this with the actual data. This fitter uses scipy's differential_evolution genetic algorithm module to estimate initial parameter values for the non-linear fitter, which uses the Latin Hypercube algorithm to ensure a thorough search of parameter space and requires bounds within which to search. Here those bounds are taken from the data maximum and minimum values.
Subtracting the model predictions from this fitted curve should leave you with only the sine component to be modeled. I noted that there seems to be an additional narrow, low-amplitude peak at approximately x = 275.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
##########################################################
# load data section
f = open('/home/zunzun/temp/temp.dat')
textData = f.read()
f.close()
xData = []
yData = []
for line in textData.split('\n'):
if line: # ignore blank lines
spl = line.split()
xData.append(float(spl[0]))
yData.append(float(spl[1]))
xData = numpy.array(xData)
yData = numpy.array(yData)
##########################################################
# model to be fitted
def func(x, a, b, c, offset): # Extreme Valye Peak equation from zunzun.com
return a * numpy.exp(-1.0 * numpy.exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset
##########################################################
# fitting section
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
minData = min(minX, minY)
maxData = max(maxX, maxY)
parameterBounds = []
parameterBounds.append([minData, maxData]) # search bounds for a
parameterBounds.append([minData, maxData]) # search bounds for b
parameterBounds.append([minData, maxData]) # search bounds for c
parameterBounds.append([minY, maxY]) # search bounds for offset
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
UPDATE -------
If the high-frequency sine component is constant (which I do not know) then modeling a small portion of the data with only a few cycles will be sufficient to determine the equation and initial parameter estimates for fitting the sine wave portion of the model. Here I have done this with the following result:
from the following equation:
amplitude = -1.0362957093184177E+00
center = 3.6632754608370377E+01
width = 5.0813421718648293E+00
Offset = 5.1940843481496088E+00
pi = 3.14159265358979323846 # constant not fitted
y = amplitude * sin(pi * (x - center) / width) + Offset
Combining these two models using the actual data, rather than my scatterplot-extracted data, should be close to what you need.
I tried the mixture of Gaussians model and am facing several problems. I have pasted my whole code at ideone. The code is at: https://ideone.com/dNYtZ2
When I try to run fitMixGauss(data, k)), I get a singular matrix error from the function below.
def fitMixGauss(data, k):
"""
Estimate a k MoG model that would fit the data. Incremently plots the outcome.
Keyword arguments:
data -- d by n matrix containing data points.
k -- scalar representing the number of gaussians to use in the MoG model.
Returns:
mixGaussEst -- dict containing the estimated MoG parameters.
"""
# MAIN E-M ROUTINE
# In the E-M algorithm, we calculate a complete posterior distribution over
# the (nData) hidden variables in the E-Step.
# In the M-Step, we update the parameters of the Gaussians (mean, cov, w).
nDims, nData = data.shape
postHidden = np.zeros(shape=(k, nData))
# we will initialize the values to random values
mixGaussEst = dict()
mixGaussEst['d'] = nDims
mixGaussEst['k'] = k
mixGaussEst['weight'] = (1 / k) * np.ones(shape=(k))
mixGaussEst['mean'] = 2 * np.random.randn(nDims, k)
mixGaussEst['cov'] = np.zeros(shape=(nDims, nDims, k))
for cGauss in range(k):
mixGaussEst['cov'][:, :, cGauss] = 2.5 + 1.5 * np.random.uniform() * np.eye(nDims)
# calculate current likelihood
# TO DO - fill in this routine
logLike = getMixGaussLogLike(data, mixGaussEst)
print('Log Likelihood Iter 0 : {:4.3f}\n'.format(logLike))
nIter = 30;
logLikeVec = np.zeros(shape=(2 * nIter))
boundVec = np.zeros(shape=(2 * nIter))
fig, ax = plt.subplots(1, 1)
for cIter in range(nIter):
# ===================== =====================
# Expectation step
# ===================== =====================
curCov = mixGaussEst['cov']
curWeight = mixGaussEst['weight']
curMean = mixGaussEst['mean']
num= np.zeros(shape=(k,nData))
for cData in range(nData):
# TO DO (g) : fill in column of 'hidden' - calculate posterior probability that
# this data point came from each of the Gaussians
# replace this:
thisData = data[:,cData]
#for c in range(k):
# num[c] = mixGaussEst['weight'][c] * (1/((2*np.pi)**(nDims)*np.linalg.det(mixGaussEst['cov'][:,:,c]))**(1/2))*np.exp(-0.5*(np.transpose(thisData-mixGaussEst['mean'][:,c])))#np.linalg.inv(mixGaussEst['cov'][:,:,c])#(thisData-mixGaussEst['mean'][:,c])
thisdata = data[:,cData];
denominatorExp = 0
for j in range(k):
mu = curMean[:,j]
sigma = curCov[:,:,j]
curNorm = (1/((2*np.pi)**(nDims)*np.linalg.det(sigma))**(1/2))*np.exp(-0.5*(np.transpose(thisData-mu)))#np.linalg.inv(sigma)#(mu)
num[j,cData] = curWeight[j]*curNorm
denominatorExp = denominatorExp + num[j,cData]
postHidden[:, cData] = num[:,cData]/denominatorExp
# ===================== =====================
# Maximization Step
# ===================== =====================
# for each constituent Gaussian
for cGauss in range(k):
# TO DO (h): Update weighting parameters mixGauss.weight based on the total
# posterior probability associated with each Gaussian. Replace this:
#mixGaussEst['weight'][cGauss] = mixGaussEst['weight'][cGauss]
sum_Kth_Gauss_Resp = np.sum(postHidden[cGauss,:])
mixGaussEst['weight'][cGauss] = sum_Kth_Gauss_Resp /np.sum(postHidden)
#mixGaussEst['weight'][cGauss] = np.sum(postHidden[cGauss,:])/sum(sum(postHidden[:,:]));
# TO DO (i): Update mean parameters mixGauss.mean by weighted average
# where weights are given by posterior probability associated with
# Gaussian. Replace this:
#mixGaussEst['mean'][:,cGauss] = mixGaussEst['mean'][:,cGauss]
numerator = 0
for j in range(nData):
numerator = numerator + postHidden[cGauss,j]*data[:,j]
numerator = np.dot( postHidden[cGauss,:],data[0,:])
mixGaussEst['mean'][:,cGauss] = numerator / sum_Kth_Gauss_Resp
# TO DO (j): Update covarance parameter based on weighted average of
# square distance from update mean, where weights are given by
# posterior probability associated with Gaussian
#mixGaussEst['cov'][:,:,cGauss] = mixGaussEst['cov'][:,:,cGauss]
muMatrix = mixGaussEst['mean'][:,cGauss]
muMatrix = muMatrix.reshape((2,1))
numerator = 0
for j in range(nData):
kk=data[:,j]
kk.reshape((2,1))
numerator_i = postHidden[cGauss,j]*(kk-muMatrix)#np.transpose(kk-muMatrix)
numerator = numerator + numerator_i
mixGaussEst['cov'][:,:,cGauss] = numerator /sum_Kth_Gauss_Resp
# draw the new solution
drawEMData2d(data, mixGaussEst)
time.sleep(0.7)
fig.canvas.draw()
# calculate the log likelihood
logLike = getMixGaussLogLike(data, mixGaussEst)
print('Log Likelihood After Iter {} : {:4.3f}\n'.format(cIter, logLike))
return mixGaussEst
Log likelihood give nan and why does the whole code(in ideone) at the end give a singular matrix error.
How would one allow for one parameter of a function to vary between groups while the other parameters are fit across all of the groups?
I am using lmfit to fit a model for disease spread. I would like the exponents of the function to be fit across all data points, but the scaling factor needs to vary between different groups (to act as a proxy for different strains of the disease with varying reproduction rates in different years).
See my code below:
#### create parameters ####
params = Parameters()
params.add('tau_1', value=1.,min=0.01)
params.add('tau_2', value=1.,min=0.01)
params.add('rho_1', value=1.,min=0.01)
for i in range(0, len(sorted(variable_data.flu_season.unique()))):
season = str(sorted(variable_data.flu_season.unique())[i])
params.add('theta_' + season, value=1., min = 0.01)
#### model ####
def pop_dist_inverse_grav(params, dist, host_pop, targ_pop, flu_sea, data):
parvals = params.valuesdict()
tau_1 = parvals['tau_1']
tau_2 = parvals['tau_2']
rho_1 = parvals['rho_1']
theta_1 = parvals['theta_' + str(season)]
grav_model = theta_1 * (np.power(dist, rho_1)) / ((np.power(host_pop, tau_1)) * (np.power(targ_pop, tau_2)))
return grav_model - data
Following M Newville's advice I got the fitting to work. Code below:
import pandas as pd
from lmfit import minimize, Parameters, report_fit
import numpy as np
variable_data = pd.read_csv("../Data/variables_for_model_fit.csv", sep=",", header='infer')
season_list = sorted(variable_data.flu_season.unique())
#### create parameters ####
params1 = Parameters()
params1.add('tau_host', value=0.24,min=0, max = 3)
params1.add('tau_targ', value=0.14,min=0, max = 3)
params1.add('rho', value=0.29,min=0, max = 3)
# "global" parameters
for i, season in enumerate(season_list):
params1.add('theta_%d' % season_list[i], value=1000., min=1, max=1e6)
# creates theta parameters for each season
#### define model ####
def grav_dist_over_pop(dist, hist_pop, targ_pop, theta, rho, tau_host, tau_targ):
return theta**(-1) * dist**rho * host_pop**(-tau_host) * targ_pop**(-tau_targ)
#### objective function ####
def objective_1(params, dist, host_pop, targ_pop, flu_sea, data):
parvals1 = params1.valuesdict()
resid = np.zeros((len(season_data),len(season_data)))
for i, data in enumerate(data):
theta = parvals1['theta_%d' % flu_sea[i]]
rho = parvals1['rho']#_%d' % flu_sea[i]]
tau_host = parvals1['tau_host']#_%d' % flu_sea[i]]
tau_targ = parvals1['tau_targ']#_%d' % flu_sea[i]]
model = grav_dist_over_pop(dist, host_pop, targ_pop, theta, rho, tau_host, tau_targ)
resid[i, :] = model - data
return resid.flatten()
#### fit global variables ####
season_data = variable_data.sample(500)
# my dataset is huge so lmfit takes an age when fitting all of the
# theta values
host_pop = np.asarray(season_data.host_city_pop.values.tolist())
targ_pop = np.asarray(season_data.target_city_pop.values.tolist())
dist = np.asarray(season_data.distance.values.tolist())
data = np.asarray(season_data.time_to_spread.values.tolist())
flu_sea = np.asarray(season_data.flu_season.values.tolist())
result = minimize(objective_1, params1, args=(dist, host_pop, targ_pop, flu_sea, data))
report_fit(result.params)
A fit with lmfit (or scipy.optimize and all other tools I am aware of) is always a "global" fit, finding optimal values for a single set of parameters that are used to minimize a 1D residual array. To be clear, you can optimize parameters to fit multiple data sets (or groups or seasons), but you have to reduce the problem to a single set of parameters calculating a 1D residual array.
For your problem, I would recommend starting with defining the "grav model" that models the data for a single season or dataset. I know nothing about this sort of field (but glad that lmfit might be useful!), but from your example this might look like this (if not, please correct)
def grav_season(dist, host_pop, targ_pop, theta, rho, tau_host, tau_targ):
return theta * dist**rho * host_pop**(-tau_host) * targ_pop**(-tau_targ)
which appears to me to have 3 independent variables (dist, host_pop, targ_pop) and 4 parameters that might be variables: theta, rho, tau_host, tau_targ. Again, please correct if needed, but for the purposes here these details are not so important.
To fit a single season / dataset, you could do
from lmfit import Parameters, minimize, report_fit
def objective(params, data, dist, host_pop, targ_pop):
pars = params.valuesdict()
model = grav_season(dist, host_pop, targ_pop, pars['theta'],
pars['rho'], pars['tau_host'], pars['tau_targ'])
return (model - data)
params = Parameters()
param.add('theta', value=1., min = 0.01)
param.add('rho', value=1., min = 0.01)
param.add('tau_host', value=1., min = 0.01)
param.add('tau_targ, value=1., min = 0.01)
result = minimize(objective, params, args=(data, dist, host_pop, targ_pop))
report_fit(result.params)
Now, for multiple seasons, you could simply make a theta, rho, tau_host, tau_targ parameter for each season. But,then there isn't much point in using all that data together in a single fit. If I understand correctly, you would like tau_host, tau_targ, and rho to have the same value for all seasons.
To do this, create Parameter for the globally-applied exponents:
params = Parameters()
param.add('rho', value=1., min = 0.01)
param.add('tau_host', value=1., min = 0.01)
param.add('tau_targ', value=1., min = 0.01)
and per-season parameters:
for i, season in enumerate(sorted(variable_data.flu_season.unique())):
param.add('theta_%d' % i, value=1., min = 0.01)
param.add('rho_%d' % i, expr='rho')
param.add('tau_host_%d' % i, expr='tau_host')
param.add('tau_targ_%d' % i, expr='tau_targ')
Note that theta_i will vary independently, but rho_i, etc will be constrained to take the value of the variable rho, etc. This gives a full set of parameters per season, but ones meeting your constraints. This approach allows you to easily change one or more of these to vary separately if that looks like it might require more testing. To do that, you could just say:
params['rho_7'].set(value=2.0, vary=True, expr='') # vary independently
To use this set of multi-season parameters, you will also need multi-season data. I'm not sure if dist, host_pop and targ_pop are meant to vary with season, or only data. I'll assume that only data changes with season (but it should be easy enough to change if not). Construct a list with data for each season, and then modify your objective function would look something like:
def objective(params, season_data, dist, host_pop, targ_pop):
pars = params.valuesdict()
resid = np.zeros((len(season_data), len(season_data[0]))
for i, data in enumerate(season_data):
theta = pars['theta_%d' % i]
rho = pars['rho_%d' % i]
tau_host = pars['tau_host_%d' % i]
tau_targ = pars['tau_targ_%d' % i]
model = grav_season(dist, host_pop, targ_pop,
theta, rho, tau_host, tau_targ)
resid[i, :] = model - data
return resid.flatten()
result = minimize(objective, params, args=(season_data, dist, host_pop, targ_pop))
report_fit(result.params)
Hope that helps get you started. Again, the main recommendations are:
create a per-season / per-dataset model funciton for the phenomenon you are modeling.
use 'expr' to constrain per-season parameters to take "global values". Though you want all seasons
to have the same value, it might be that some parameters might vary linearly with
season, or are required to add to some value or something else that means they aren't really independent.
I am using PyMC to fit some data to a straight line. The data have outliers, so I adapted some code (third example at the link) written by Jake Vanderplas for his textbook. The method uses a vector variable qi to encode whether each individual data point belongs to the foreground model (which we are fitting to the line) or the background model, which we don't care about.
class lin_fit_ol(object):
'''
fit a straight line to one independent variable
(`xi`, with zero errors) and one dependent variable
(`yi`, with possibly heteroscedastic errors `dyi`)
Outliers in `yi` are permitted
Intended to be a complement to a straight-line fit, for model
testing purposes
Modified from Vanderplas's code
(found at http://www.astroml.\
org/book_figures/chapter8/fig_outlier_rejection.html)
'''
def __init__(self, xi, yi, dyi, value):
self.xi, self.yi, self.dyi, self.value = xi, yi, dyi, value
#pymc.stochastic
def beta(value=np.array([0.5, 1.0])):
"""Slope and intercept parameters for a straight line.
The likelihood corresponds to the prior probability of the parameters."""
slope, intercept = value
prob_intercept = 1 + 0 * intercept
# uniform prior on theta = arctan(slope)
# d[arctan(x)]/dx = 1 / (1 + x^2)
prob_slope = np.log(1. / (1. + slope ** 2))
return prob_intercept + prob_slope
#pymc.deterministic
def model(xi=xi, beta=beta):
slope, intercept = beta
return slope * xi + intercept
# uniform prior on Pb, the fraction of bad points
Pb = pymc.Uniform('Pb', 0, 1.0, value=0.1)
# uniform prior on Yb, the centroid of the outlier distribution
Yb = pymc.Uniform('Yb', -10000, 10000, value=0)
# uniform prior on log(sigmab), the spread of the outlier distribution
log_sigmab = pymc.Uniform('log_sigmab', -10, 10, value=5)
# qi is bernoulli distributed
# Note: this syntax requires pymc version 2.2
qi = pymc.Bernoulli('qi', p=1 - Pb, value=np.ones(len(xi)))
#pymc.deterministic
def sigmab(log_sigmab=log_sigmab):
return np.exp(log_sigmab)
def outlier_likelihood(yi, mu, dyi, qi, Yb, sigmab):
"""likelihood for full outlier posterior"""
Vi = dyi ** 2
Vb = sigmab ** 2
root2pi = np.sqrt(2 * np.pi)
logL_in = -0.5 * np.sum(
qi * (np.log(2 * np.pi * Vi) + (yi - mu) ** 2 / Vi))
logL_out = -0.5 * np.sum(
(1 - qi) * (np.log(2 * np.pi * (Vi + Vb)) +
(yi - Yb) ** 2 / (Vi + Vb)))
return logL_out + logL_in
OutlierNormal = pymc.stochastic_from_dist(
'outliernormal', logp=outlier_likelihood, dtype=np.float,
mv=True)
y_outlier = OutlierNormal(
'y_outlier', mu=model, dyi=dyi, Yb=Yb, sigmab=sigmab, qi=qi,
observed=True, value=yi)
self.M = dict(y_outlier=y_outlier, beta=beta, model=model,
qi=qi, Pb=Pb, Yb=Yb, log_sigmab=log_sigmab,
sigmab=sigmab)
self.sample_invoked = False
def sample(self, iter, burn, calc_deviance=True):
self.S0 = pymc.MCMC(self.M)
self.S0.sample(iter=iter, burn=burn)
self.trace = self.S0.trace('beta')
self.btrace = self.trace[:, 0]
self.mtrace = self.trace[:, 1]
self.sample_invoked = True
def triangle(self):
assert self.sample_invoked == True, \
'Must sample first! Use sample(iter, burn)'
corner(self.trace[:], labels=['$m$', '$b$'])
def plot(self, xlab='$x$', ylab='$y$'):
# plot the data points
plt.errorbar(self.xi, self.yi, yerr=self.dyi, fmt='.k')
# do some shimmying to get quantile bounds
xa = np.linspace(self.xi.min(), self.xi.max(), 100)
A = np.vander(xa, 2)
# generate all possible lines
lines = np.dot(self.trace[:], A.T)
quantiles = np.percentile(lines, [16, 84], axis=0)
plt.fill_between(xa, quantiles[0], quantiles[1],
color="#8d44ad", alpha=0.5)
# plot circles around points identified as outliers
qi = self.S0.trace('qi')[:]
Pi = qi.astype(float).mean(0)
outlier_x = self.xi[Pi < 0.32]
outlier_y = self.yi[Pi < 0.32]
plt.scatter(outlier_x, outlier_y, lw=1, s=400, alpha=0.5,
facecolors='none', edgecolors='red')
plt.xlabel(xlab)
plt.ylabel(ylab)
def ICs(self):
self.MAP = pymc.MAP(self.M)
self.MAP.fit()
self.BIC = self.MAP.BIC
self.AIC = self.MAP.AIC
self.logp = self.MAP.logp
self.logp_at_max = self.MAP.logp_at_max
return self.AIC, self.BIC
So, when we calculate the BIC and AIC using this model, we get very large values (since there are lots of points). This makes total sense. However, this disfavors having many data points, which irks me. Plus, the large AIC and BIC would make a casual observer believe that the other model (which fits poorly as a result of the outliers) is actually the better model.
Am I missing a subtlety of the BIC and AIC here, or is a harsh reality of using mixture models that you always have to use a bunch of extra binary parameters to denote the membership of your datapoints?
I recommend the book "Introduction to statistical learning"
On page 212 you find the formulas for AIC and BIC. In each of these formulas the sample number is in the denominator. Hence, the result should not be influenced by the number of samples. At least not in that obvious way.