AIC & BIC of PyMC mixture model - python

I am using PyMC to fit some data to a straight line. The data have outliers, so I adapted some code (third example at the link) written by Jake Vanderplas for his textbook. The method uses a vector variable qi to encode whether each individual data point belongs to the foreground model (which we are fitting to the line) or the background model, which we don't care about.
class lin_fit_ol(object):
'''
fit a straight line to one independent variable
(`xi`, with zero errors) and one dependent variable
(`yi`, with possibly heteroscedastic errors `dyi`)
Outliers in `yi` are permitted
Intended to be a complement to a straight-line fit, for model
testing purposes
Modified from Vanderplas's code
(found at http://www.astroml.\
org/book_figures/chapter8/fig_outlier_rejection.html)
'''
def __init__(self, xi, yi, dyi, value):
self.xi, self.yi, self.dyi, self.value = xi, yi, dyi, value
#pymc.stochastic
def beta(value=np.array([0.5, 1.0])):
"""Slope and intercept parameters for a straight line.
The likelihood corresponds to the prior probability of the parameters."""
slope, intercept = value
prob_intercept = 1 + 0 * intercept
# uniform prior on theta = arctan(slope)
# d[arctan(x)]/dx = 1 / (1 + x^2)
prob_slope = np.log(1. / (1. + slope ** 2))
return prob_intercept + prob_slope
#pymc.deterministic
def model(xi=xi, beta=beta):
slope, intercept = beta
return slope * xi + intercept
# uniform prior on Pb, the fraction of bad points
Pb = pymc.Uniform('Pb', 0, 1.0, value=0.1)
# uniform prior on Yb, the centroid of the outlier distribution
Yb = pymc.Uniform('Yb', -10000, 10000, value=0)
# uniform prior on log(sigmab), the spread of the outlier distribution
log_sigmab = pymc.Uniform('log_sigmab', -10, 10, value=5)
# qi is bernoulli distributed
# Note: this syntax requires pymc version 2.2
qi = pymc.Bernoulli('qi', p=1 - Pb, value=np.ones(len(xi)))
#pymc.deterministic
def sigmab(log_sigmab=log_sigmab):
return np.exp(log_sigmab)
def outlier_likelihood(yi, mu, dyi, qi, Yb, sigmab):
"""likelihood for full outlier posterior"""
Vi = dyi ** 2
Vb = sigmab ** 2
root2pi = np.sqrt(2 * np.pi)
logL_in = -0.5 * np.sum(
qi * (np.log(2 * np.pi * Vi) + (yi - mu) ** 2 / Vi))
logL_out = -0.5 * np.sum(
(1 - qi) * (np.log(2 * np.pi * (Vi + Vb)) +
(yi - Yb) ** 2 / (Vi + Vb)))
return logL_out + logL_in
OutlierNormal = pymc.stochastic_from_dist(
'outliernormal', logp=outlier_likelihood, dtype=np.float,
mv=True)
y_outlier = OutlierNormal(
'y_outlier', mu=model, dyi=dyi, Yb=Yb, sigmab=sigmab, qi=qi,
observed=True, value=yi)
self.M = dict(y_outlier=y_outlier, beta=beta, model=model,
qi=qi, Pb=Pb, Yb=Yb, log_sigmab=log_sigmab,
sigmab=sigmab)
self.sample_invoked = False
def sample(self, iter, burn, calc_deviance=True):
self.S0 = pymc.MCMC(self.M)
self.S0.sample(iter=iter, burn=burn)
self.trace = self.S0.trace('beta')
self.btrace = self.trace[:, 0]
self.mtrace = self.trace[:, 1]
self.sample_invoked = True
def triangle(self):
assert self.sample_invoked == True, \
'Must sample first! Use sample(iter, burn)'
corner(self.trace[:], labels=['$m$', '$b$'])
def plot(self, xlab='$x$', ylab='$y$'):
# plot the data points
plt.errorbar(self.xi, self.yi, yerr=self.dyi, fmt='.k')
# do some shimmying to get quantile bounds
xa = np.linspace(self.xi.min(), self.xi.max(), 100)
A = np.vander(xa, 2)
# generate all possible lines
lines = np.dot(self.trace[:], A.T)
quantiles = np.percentile(lines, [16, 84], axis=0)
plt.fill_between(xa, quantiles[0], quantiles[1],
color="#8d44ad", alpha=0.5)
# plot circles around points identified as outliers
qi = self.S0.trace('qi')[:]
Pi = qi.astype(float).mean(0)
outlier_x = self.xi[Pi < 0.32]
outlier_y = self.yi[Pi < 0.32]
plt.scatter(outlier_x, outlier_y, lw=1, s=400, alpha=0.5,
facecolors='none', edgecolors='red')
plt.xlabel(xlab)
plt.ylabel(ylab)
def ICs(self):
self.MAP = pymc.MAP(self.M)
self.MAP.fit()
self.BIC = self.MAP.BIC
self.AIC = self.MAP.AIC
self.logp = self.MAP.logp
self.logp_at_max = self.MAP.logp_at_max
return self.AIC, self.BIC
So, when we calculate the BIC and AIC using this model, we get very large values (since there are lots of points). This makes total sense. However, this disfavors having many data points, which irks me. Plus, the large AIC and BIC would make a casual observer believe that the other model (which fits poorly as a result of the outliers) is actually the better model.
Am I missing a subtlety of the BIC and AIC here, or is a harsh reality of using mixture models that you always have to use a bunch of extra binary parameters to denote the membership of your datapoints?

I recommend the book "Introduction to statistical learning"
On page 212 you find the formulas for AIC and BIC. In each of these formulas the sample number is in the denominator. Hence, the result should not be influenced by the number of samples. At least not in that obvious way.

Related

Trying to fit a trig function to data with scipy

I am trying to fit some data using scipy.optimize.curve_fit. I have read the documentation and also this StackOverflow post, but neither seem to answer my question.
I have some data which is simple, 2D data which looks approximately like a trig function. I want to fit it with a general trig function
using scipy.
My approach is as follows:
from __future__ import division
import numpy as np
from scipy.optimize import curve_fit
#Load the data
data = np.loadtxt('example_data.txt')
t = data[:,0]
y = data[:,1]
#define the function to fit
def func_cos(t,A,omega,dphi,C):
# A is the amplitude, omega the frequency, dphi and C the horizontal/vertical shifts
return A*np.cos(omega*t + dphi) + C
#do a scipy fit
popt, pcov = curve_fit(func_cos, t,y)
#Plot fit data and original data
fig = plt.figure(figsize=(14,10))
ax1 = plt.subplot2grid((1,1), (0,0))
ax1.plot(t,y)
ax1.plot(t,func_cos(t,*popt))
This outputs:
where blue is the data orange is the fit. Clearly I am doing something wrong. Any pointers?
If no values are provided for initial guess of the parameters p0 then a value of 1 is assumed for each of them. From the docs:
p0 : array_like, optional
Initial guess for the parameters (length N). If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
Since your data has very large x-values and very small y-values an initial guess of 1 is far from the actual solution and hence the optimizer does not converge. You can help the optimizer by providing suitable initial parameter values that can be guessed / approximated from the data:
Amplitude: A = (y.max() - y.min()) / 2
Offset: C = (y.max() + y.min()) / 2
Frequency: Here we can estimate the number of zero crossing by multiplying consecutive y-values and check which products are smaller than zero. This number divided by the total x-range gives the frequency and in order to get it in units of pi we can multiply that number by pi: y_shifted = y - offset; oemga = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min())
Phase shift: can be set to zero, dphi = 0
So in summary, the following initial parameter guess can be used:
offset = (y.max() + y.min()) / 2
y_shifted = y - offset
p0 = (
(y.max() - y.min()) / 2,
np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min()),
0,
offset
)
popt, pcov = curve_fit(func_cos, t, y, p0=p0)
Which gives me the following fit function:

Multivariate curve-fitting in python for estimating the parameter and order of ellipse-like shapes

I'm trying to find the best parameters (a, b, and c) of the following function (general formula of circle, ellipse, or rhombus):
(|x|/a)^c + (|y|/b)^c = 1
of two arrays of independent data (x and y) in python. My main objective is to estimate the best value of (a, b, and c) based on my x and y variable. I am using curve_fit function from scipy, so here is my code with a demo x, and y.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
alpha = 5
beta = 3
N = 500
DIM = 2
np.random.seed(2)
theta = np.random.uniform(0, 2*np.pi, (N,1))
eps_noise = 0.2 * np.random.normal(size=[N,1])
circle = np.hstack([np.cos(theta), np.sin(theta)])
B = np.random.randint(-3, 3, (DIM, DIM))
noisy_ellipse = circle.dot(B) + eps_noise
X = noisy_ellipse[:,0:1]
Y = noisy_ellipse[:,1:]
def func(xdata, a, b,c):
x, y = xdata
return (np.abs(x)/a)**c + (np.abs(y)/b)**c
xdata = np.transpose(np.hstack((X, Y)))
ydata = np.ones((xdata.shape[1],))
pp, pcov = curve_fit(func, xdata, ydata, maxfev = 1000000, bounds=((0, 0, 1), (50, 50, 2)))
plt.scatter(X, Y, label='Data Points')
x_coord = np.linspace(-5,5,300)
y_coord = np.linspace(-5,5,300)
X_coord, Y_coord = np.meshgrid(x_coord, y_coord)
Z_coord = func((X_coord,Y_coord),pp[0],pp[1],pp[2])
plt.contour(X_coord, Y_coord, Z_coord, levels=[1], colors=('g'), linewidths=2)
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
By using this code, the parameters are [4.69949891, 3.65493859, 1.0] for a, b, and c.
The problem is that I usually get the value of c the smallest in its bound, while in this demo data it (i.e., c parameter) supposes to be very close to 2 as the data represent an ellipse.
Any help and suggestions for solving this issue are appreciated.
A curve which equation is (|x/a|)^c + (|y/b|)^c = 1 is called "Superellipse" :
http://mathworld.wolfram.com/Superellipse.html
For large c the superellipse tends to a rectangular shape.
For c=2 the curve is an ellipse, or a circle in the particular case a=b.
For c close to 1 the superellipse tends to a rhombus shape.
For c larger than 0 and lower than 1 the superellipse looks like a (squashed) astroid with sharp vertices. This kind of shape will not be considered below.
Before looking to the right question of the OP, it is of interest to study the regression behaviour for fitting a superellipse to scattered data. A short experimental and simplified approach tends to make understand the mathematical difficulty, prior the programming difficulties.
When the scatter increases the computed value of c (corresponding to the minimum of MSE ) decreases. Also the minimum becomes more and more difficult to localize. This is certainly a difficulty for the softwares.
For even larger scatter the value of c=1 leads to a rhombus shape.
So, it is not surprizing that in the example highly scattered published by the OP the software gave a rhombus as fitted curve.
If this was not the expected result, one have to chose another goal than the minimum MSE. For example if the goal is to obtain an elliptic shape, one have to set c=2. The result on the next figure shows that the MSE is worse than with the preceeding rhombus shape. But the elliptic fitting is well achieved.
NOTE : In case of large scatter the result depends a lot from the choice of criteria of fitting (MSE, MAE, ..., and with respect to what variable). This can be the cause of very different results from a software to another if the criterias of fitting (sometime not explicit) are different.
Among the criterias of fitting, if it is specified that the rhombus shape is excluded, one have to define more representative criteria and/or model and implement them in the software.
IMPORTANCE OF CRITERIA OF FITTING :
In order to show how the choice of criteria of fitting is important especially in case of data highly scattered, we will make the study again with a different criteria.
Instead of the preceeding criteria which was the MSE of the errors on the superellipse equation itself, that was :
we chose a different criteria, for example the MSE of the errors on the radial coordinate in polar system :
The notations are defined on the next picture :
Some results from the empirical study for increasing scatter :
We observe that the numerical calculus with the second criteria is more robust that with the first. Cases with higher scatter can be treated With the second criteria of fitting .
The drawback it that this second criteria is probably not considered in the available softwares. So one have to implement the above formulas in the existing software if possible. Or to write a software especially adapted.
Nevertheless this discussion about criteria of fitting is somehow out of subject because the criteria of fitting should not result from mathematical considerations only. If the problem comes from a practical need in physic or technology the criteria of fitting might be derived from the reality without choice.
I have modified your code (though you took it from https://stackoverflow.com/a/47881806/10640534) quite a lot, but I think I have what you expect. I am using a different equation, which I found here. I have also used the new Numpy random generators, but I believe that is only aesthetic for this problem. I am drawing the ellipse using patches from matplotlib, which indeed is aesthetic, but definitely a way better solution to represent your conic. Importantly, I am using the dogbox method for curve_fit because other methods do not converge; occasionally the ellipse is not matched and decreasing the added noise (e.g., rng.normal(0, 1, (500, 2)) / 1e2 instead of rng.normal(0, 1, (500, 2)) / 1e1 helps). Anyway, snippet and figure below.
import numpy as np
from numpy.random import default_rng
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(data, a, b, h, k, A):
x, y = data
return ((((x - h) * np.cos(A) + (y - k) * np.sin(A)) / a) ** 2
+ (((x - h) * np.sin(A) - (y - k) * np.cos(A)) / b) ** 2)
rng = default_rng(3)
numPoints = 500
center = rng.random(2) * 10 - 5
theta = rng.uniform(0, 2 * np.pi, (numPoints, 1))
circle = np.hstack([np.cos(theta), np.sin(theta)])
ellipse = (circle.dot(rng.random((2, 2)) * 2 * np.pi - np.pi)
+ (center[0], center[1]) + rng.normal(0, 1, (500, 2)) / 1e1)
pp, pcov = curve_fit(func, (ellipse[:, 0], ellipse[:, 1]), np.ones(numPoints),
p0=(1, 1, center[0], center[1], np.pi / 2),
method='dogbox')
plt.scatter(ellipse[:, 0], ellipse[:, 1], label='Data Points')
plt.gca().add_patch(Ellipse(xy=(pp[2], pp[3]), width=2 * pp[0],
height=2 * pp[1], angle=pp[4] * 180 / np.pi,
fill=False))
plt.gca().set_aspect('equal')
plt.tight_layout()
plt.show()
To incorporate the value of the exponent, I have used your equation and generated an ellipse according to this answer. This results in:
import numpy as np
from numpy.random import default_rng
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, root
from scipy.special import ellipeinc
def angles_in_ellipse(num, a, b):
assert(num > 0)
assert(a < b)
angles = 2 * np.pi * np.arange(num) / num
if a != b:
e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
tot_size = ellipeinc(2.0 * np.pi, e)
arc_size = tot_size / num
arcs = np.arange(num) * arc_size
res = root(lambda x: (ellipeinc(x, e) - arcs), angles)
angles = res.x
return angles
def func(data, a, b, c):
x, y = data
return (np.absolute(x) / a) ** c + (np.absolute(y) / b) ** c
a = 10
b = 20
n = 100
phi = angles_in_ellipse(n, a, b)
e = (1.0 - a ** 2.0 / b ** 2.0) ** 0.5
arcs = ellipeinc(phi, e)
noise = default_rng(0).normal(0, 1, n) / 2
pp, pcov = curve_fit(func, (b * np.sin(phi) + noise,
a * np.cos(phi) + noise),
np.ones(n), method='lm')
plt.scatter(b * np.sin(phi) + noise, a * np.cos(phi) + noise,
label='Data Points')
plt.gca().add_patch(Ellipse(xy=(0, 0), width=2 * pp[0], height=2 * pp[1],
angle=0, fill=False))
plt.gca().set_aspect('equal')
plt.tight_layout()
plt.show()
As you decrease noise values, pp will tend to (b, a, 2).

Curve fitting with nth order polynomial having sine ripples

I'm modeling measurement errors in a certain measuring device. This is how the data looks: high frequency sine ripples on a low frequency polynomial. My model should capture the ripples too.
The curve that fits the error should be of the form: error(x) = a0 + a1*x + a2*x^2 + ... an*x^n + Asin(x/lambda). The order n of the polynomial is not known. My plan is to iterate n from 1-9 and select the one that has the highest F-value.
I've played with numpy.polyfit and scipy.optimize.curve_fit so far. numpy.polyfit is only for polynomials, so while I can generate the "best fit" polynomial, there's no way to determine the parameters A and lambda for the sine term. scipy.optimize.curve_fit would have worked great if I already knew the order of the polynomial for the polynomial part of error(x).
Is there a clever way to use both numpy.polyfit and scipy.optimize.curve_fit to get this done? Or another library-function perhaps?
Here's the code for how I'm using numpy.polyfit to select the best polynomial:
def GetErrorPolynomial(X, Y):
maxFval = 0.0
for i in range(1, 10): # i is the order of the polynomial (max order = 9)
error_func = np.polyfit(X, Y, i)
error_func = np.poly1d(error_func)
# F-test (looking for the largest F value)
numerator = np.sum(np.square(error_func(X) - np.mean(Y))) / i
denominator = np.sum(np.square(Y - error_func(X))) / (Y.size - i - 1)
Fval = numerator / denominator
if Fval > maxFval:
maxFval = Fval
maxFvalPolynomial = error_func
return maxFvalPolynomial
And here's the code for how I'm using curve_fit:
def poly_sine_fit(x, a, b, c, d, l):
return a*np.square(x) + b*x + c + d*np.sin(x/l)
param, _ = curve_fit(poly_sine_fit, x_data, y_data)
It's "hardcoded" to a quadratic function, but I want to select the "best" order as I'm doing above with np.polyfit
I finally found a way to model the ripples and can answer my own question. This 2006 paper does curve-fitting on ripples that resemble my dataset.
First off, I did a least squares polynomial fit and then subtracted this polynomial curve from the original data. This left me with only the ripples. Applying the Fourier transform, I picked out the dominant frequencies which let me reconstruct the sine ripples. Then I simply added these ripples to the polynomial curve I had obtained in the beginning. That did it.
Use Scikit-learn Linear Regression
Here is a code sample I used to perform a linear regression with a polynom of degree 3 that pass by the point 0 with value 1 and null derivative. You just have to adapt the function create_vector with the function you want.
from sklearn import linear_model
import numpy as np
def create_vector(x):
# currently representing a polynom Y = a*X^3 + b*X^2
x3 = np.power(x, 3)
x2 = np.power(x, 2)
X = np.append(x3, x2, axis=1)
return X
data_x = [some_data_input]
data_y = [some_data_output]
x = np.array(data_x).reshape(-1, 1)
y_data = np.array(data_y).reshape(-1, 1)-1 # -1 to pass by the point (0,1)
X = create_vector(x)
regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(X, y_data)
I extracted data from the scatterplot for analysis and found that a polynomial + sine did not seem to be an optimal model, because lower order polynomials were not following the shape of the data very well and higher order polynomials were exhibiting Runge's phenomenon of high curvature at the data extremes. I performed an equation search to find what the high-frequency sine wave might be imposed upon, and a good candidate seemed to be the Extreme Value peak equation "a * exp(-1.0 * exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset" as shown below.
Here is a graphical Python curve fitter for this equation, at the top of the file I load the data I had extracted so you would need to replace this with the actual data. This fitter uses scipy's differential_evolution genetic algorithm module to estimate initial parameter values for the non-linear fitter, which uses the Latin Hypercube algorithm to ensure a thorough search of parameter space and requires bounds within which to search. Here those bounds are taken from the data maximum and minimum values.
Subtracting the model predictions from this fitted curve should leave you with only the sine component to be modeled. I noted that there seems to be an additional narrow, low-amplitude peak at approximately x = 275.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
##########################################################
# load data section
f = open('/home/zunzun/temp/temp.dat')
textData = f.read()
f.close()
xData = []
yData = []
for line in textData.split('\n'):
if line: # ignore blank lines
spl = line.split()
xData.append(float(spl[0]))
yData.append(float(spl[1]))
xData = numpy.array(xData)
yData = numpy.array(yData)
##########################################################
# model to be fitted
def func(x, a, b, c, offset): # Extreme Valye Peak equation from zunzun.com
return a * numpy.exp(-1.0 * numpy.exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset
##########################################################
# fitting section
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
minData = min(minX, minY)
maxData = max(maxX, maxY)
parameterBounds = []
parameterBounds.append([minData, maxData]) # search bounds for a
parameterBounds.append([minData, maxData]) # search bounds for b
parameterBounds.append([minData, maxData]) # search bounds for c
parameterBounds.append([minY, maxY]) # search bounds for offset
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
UPDATE -------
If the high-frequency sine component is constant (which I do not know) then modeling a small portion of the data with only a few cycles will be sufficient to determine the equation and initial parameter estimates for fitting the sine wave portion of the model. Here I have done this with the following result:
from the following equation:
amplitude = -1.0362957093184177E+00
center = 3.6632754608370377E+01
width = 5.0813421718648293E+00
Offset = 5.1940843481496088E+00
pi = 3.14159265358979323846 # constant not fitted
y = amplitude * sin(pi * (x - center) / width) + Offset
Combining these two models using the actual data, rather than my scatterplot-extracted data, should be close to what you need.

Why isn't this centered fourth-order-accurate finite differencing scheme yielding fourth-order convergence for solving pdes

I am solving the dissipation equation using a finite differencing scheme. The initial condition is a half sin wave with Dirchlet boundary conditions on both sides. I insert an extra point on each side of the domain to enforce the Dirchlet boundary condition while maintaining fourth-order-accuracy, then use forwards-euler to evolve it in time
When I switch from the second-order-accurate stencil to the fourth-order-accurate stencil /12.
I do not see an improvement in the rate of convergence when I plot vs an estimate of the error.
I wrote and commented a code that shows my problem. When I use the 5 point strategy, my rate of convergence is the same:
Why is this happening? Why isn't the fourth-order-accurate stencil helping the convergence rate? I combed over this carefully and I think that there must be some issue in my understanding.
# Let's evolve the diffusion equation in time with Dirchlet BCs
# Load modules
import numpy as np
import matplotlib.pyplot as plt
# Domain size
XF = 1
# Viscosity
nu = 0.01
# Spatial Differentiation function, approximates d^u/dx^2
def diffusive_dudt(un, nu, dx, strategy='5c'):
undiff = np.zeros(un.size, dtype=np.float128)
# O(h^2)
if strategy == '3c':
undiff[2:-2] = nu * (un[3:-1] - 2 * un[2:-2] + un[1:-3]) / dx**2
# O(h^4)
elif strategy == '5c':
undiff[2:-2] = nu * (-1 * un[4:] + 16 * un[3:-1] - 30 * un[2:-2] + 16 * un[1:-3] - un[:-4]) / (12 * dx**2 )
else: raise(IOError("Invalid diffusive strategy")) ; quit()
return undiff
def geturec(x, nu=.05, evolution_time=1, u0=None, n_save_t=50, ubl=0., ubr=0., diffstrategy='5c', dt=None, returndt=False):
dx = x[1] - x[0]
# Prescribde cfl=0.1 and ftcs=0.2
if dt is not None: pass
else: dt = min(.1 * dx / 1., .2 / nu * dx ** 2)
if returndt: return dt
nt = int(evolution_time / dt)
divider = int(nt / float(n_save_t))
if divider ==0: raise(IOError("not enough time steps to save %i times"%n_save_t))
# The default initial condition is a half sine wave.
u_initial = ubl + np.sin(x * np.pi)
if u0 is not None: u_initial = u0
u = u_initial
u[0] = ubl
u[-1] = ubr
# insert ghost cells; extra cells on the left and right
# for the edge cases of the finite difference scheme
x = np.insert(x, 0, x[0]-dx)
x = np.insert(x, -1, x[-1]+dx)
u = np.insert(u, 0, ubl)
u = np.insert(u, -1, ubr)
# u_record holds all the snapshots. They are evenly spaced in time,
# except the final and initial states
u_record = np.zeros((x.size, int(nt / divider + 2)))
# Evolve through time
ii = 1
u_record[:, 0] = u
for _ in range(nt):
un = u.copy()
dudt = diffusive_dudt(un, nu, dx, strategy=diffstrategy)
# forward euler time step
u = un + dt * dudt
# Save every xth time step
if _ % divider == 0:
#print "C # ---> ", u * dt / dx
u_record[:, ii] = u.copy()
ii += 1
u_record[:, -1] = u
return u_record[1:-1, :]
# define L-1 Norm
def ul1(u, dx): return np.sum(np.abs(u)) / u.size
# Now let's sweep through dxs to find convergence rate
# Define dxs to sweep
xrang = np.linspace(350, 400, 4)
# this function accepts a differentiation key name and returns a list of dx and L-1 points
def errf(strat):
# Lists to record dx and L-1 points
ypoints = []
dxs= []
# Establish truth value with a more-resolved grid
x = np.linspace(0, XF, 800) ; dx = x[1] - x[0]
# Record truth L-1 and dt associated with finest "truth" grid
trueu = geturec(nu=nu, x=x, diffstrategy=strat, evolution_time=2, n_save_t=20, ubl=0, ubr=0)
truedt = geturec(nu=nu, x=x, diffstrategy=strat, evolution_time=2, n_save_t=20, ubl=0, ubr=0, returndt=True)
trueqoi = ul1(trueu[:, -1], dx)
# Sweep dxs
for nx in xrang:
x = np.linspace(0, XF, nx) ; dx = x[1] - x[0]
dxs.append(dx)
# Run solver, hold dt fixed
u = geturec(nu=nu, x=x, diffstrategy='5c', evolution_time=2, n_save_t=20, ubl=0, ubr=0, dt=truedt)
# record |L-1(dx) - L-1(truth)|
qoi = ul1(u[:, -1], dx)
ypoints.append(np.abs(trueqoi - qoi))
return dxs, ypoints
# Plot results. The fourth order method should have a slope of 4 on the log-log plot.
from scipy.optimize import minimize as mini
strategy = '5c'
dxs, ypoints = errf(strategy)
def fit2(a): return 1000 * np.sum((a * np.array(dxs) ** 2 - ypoints) ** 2)
def fit4(a): return 1000 * np.sum((np.exp(a) * np.array(dxs) ** 4 - ypoints) ** 2)
a = mini(fit2, 500).x
b = mini(fit4, 11).x
plt.plot(dxs, a * np.array(dxs)**2, c='k', label=r"$\nu^2$", ls='--')
plt.plot(dxs, np.exp(b) * np.array(dxs)**4, c='k', label=r"$\nu^4$")
plt.plot(dxs, ypoints, label=r"Convergence", marker='x')
plt.yscale('log')
plt.xscale('log')
plt.xlabel(r"$\Delta X$")
plt.ylabel("$L-L_{true}$")
plt.title(r"$\nu=%f, strategy=%s$"%(nu, strategy))
plt.legend()
plt.savefig('/Users/kilojoules/Downloads/%s.pdf'%strategy, bbox_inches='tight')
The error of the scheme is O(dt,dx²) resp. O(dt, dx⁴). As you keep dt=O(dx^2), the combined error is O(dx²) in both cases. You could try to scale dt=O(dx⁴) in the second case, however the balance of truncation and floating point error of the Euler or any first order method is reached around L*dt=1e-8, where L is a Lipschitz constant for the right side, so higher for more complex right sides. Even in the best case, going beyond dx=0.01 would be futile. Using a higher order method in time direction should help.
You used the wrong error metric. If you compare the fields on a point-by-point basis you'll get the convergence rate you were after.

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Categories

Resources