How to apply linear regression with fixed x intercept in python? - python

I've found quite a few examples of fitting a linear regression with zero intercept.
However, I would like to fit a linear regression with a fixed x-intercept. In other words, the regression will start at a specific x.
I have the following code for plotting.
import numpy as np
import matplotlib.pyplot as plt
xs = np.array([0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0,
20.0, 40.0, 60.0, 80.0])
ys = np.array([0.50505332505407008, 1.1207373784533172, 2.1981844719020001,
3.1746209003398689, 4.2905482471260044, 6.2816226678076958,
11.073788414382639, 23.248479770546009, 32.120462301367183,
44.036117671229206, 54.009003143831116, 102.7077685684846,
185.72880217806673, 256.12183145545811, 301.97120103079675])
def best_fit_slope_and_intercept(xs, ys):
# m = xs.dot(ys)/xs.dot(xs)
m = (((np.average(xs)*np.average(ys)) - np.average(xs*ys)) /
((np.average(xs)*np.average(xs)) - np.average(xs*xs)))
b = np.average(ys) - m*np.average(xs)
return m, b
def rSquaredValue(ys_orig, ys_line):
def sqrdError(ys_orig, ys_line):
return np.sum((ys_line - ys_orig) * (ys_line - ys_orig))
yMeanLine = np.average(ys_orig)
sqrtErrorRegr = sqrdError(ys_orig, ys_line)
sqrtErrorYMean = sqrdError(ys_orig, yMeanLine)
return 1 - (sqrtErrorRegr/sqrtErrorYMean)
m, b = best_fit_slope_and_intercept(xs, ys)
regression_line = m*xs+b
r_squared = rSquaredValue(ys, regression_line)
print(r_squared)
plt.plot(xs, ys, 'bo')
# Normal best fit
plt.plot(xs, m*xs+b, 'r-')
# Zero intercept
plt.plot(xs, m*xs, 'g-')
plt.show()
And I want something like the follwing where the regression line starts at (5, 0).
Thank You. Any and all help is appreciated.

I been thinking for some time and I've found a possible workaround to the problem.
If I understood well, you want to find slope and intercept of the linear regression model with a fixed x-axis intercept.
Providing that's the case (imagine you want the x-axis intercept to take the value forced_intercept), it's as if you "moved" all the points -forced_intercept times in the x-axis, and then you forced scikit-learn to use y-axis intercept equal 0. You would then have the slope. To find the intercept just isolate b from y=ax+b and force the point (forced_intercept,0). When you do that, you get to b=-a*forced_intercept (where a is the slope). In code (notice xs reshaping):
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
xs = np.array([0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0,
20.0, 40.0, 60.0, 80.0]).reshape((-1,1)) #notice you must reshape your array or you will get a ValueError error from NumPy.
ys = np.array([0.50505332505407008, 1.1207373784533172, 2.1981844719020001,
3.1746209003398689, 4.2905482471260044, 6.2816226678076958,
11.073788414382639, 23.248479770546009, 32.120462301367183,
44.036117671229206, 54.009003143831116, 102.7077685684846,
185.72880217806673, 256.12183145545811, 301.97120103079675])
forced_intercept = 5 #as you provided in your example of (5,0)
new_xs = xs - forced_intercept #here we "move" all the points
model = LinearRegression(fit_intercept=False).fit(new_xs, ys) #force an intercept of 0
r = model.score(new_xs,ys)
a = model.coef_
b = -1 * a * forced_intercept #here we find the slope so that the line contains (forced intercept,0)
print(r,a,b)
plt.plot(xs,ys,'o')
plt.plot(xs,a*xs+b)
plt.show()
Hope this is what you were looking for.

May be this approach will be useful.
import numpy as np
import matplotlib.pyplot as plt
xs = np.array([0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0,
20.0, 40.0, 60.0, 80.0])
ys = np.array([0.50505332505407008, 1.1207373784533172, 2.1981844719020001,
3.1746209003398689, 4.2905482471260044, 6.2816226678076958,
11.073788414382639, 23.248479770546009, 32.120462301367183,
44.036117671229206, 54.009003143831116, 102.7077685684846,
185.72880217806673, 256.12183145545811, 301.97120103079675])
# At first we add this anchor point to the points set.
xs = np.append(xs, [5.])
ys = np.append(ys, [0.])
# Then we prepare the coefficient matrix according docs
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html
A = np.vstack([xs, np.ones(len(xs))]).T
# Then we prepare weights for these points. And we put all weights
# equal except the last one (for added anchor point).
# In this example it's weight 1000 times larger in comparison with others.
W = np.diag(np.ones([len(xs)]))
W[-1,-1] = 1000.
# And we find least-squares solution.
m, c = np.linalg.lstsq(np.dot(W, A), np.dot(W, ys), rcond=None)[0]
plt.plot(xs, ys, 'o', label='Original data', markersize=10)
plt.plot(xs, m * xs + c, 'r', label='Fitted line')
plt.show()

If you used scikit-learn for linear regression task, it's possible to define intercept(s) using intercept_ attribute.

Related

How to add error bars to histograms with weights using matplotlib?

I have created a histogram using matplotlib of my experimental data, which consists of the value measured and the weight. Using the weights argument of plt.hist it is no problem weighting together the events, but when I look at options for errorbars none seem to take event weights into account. There are solutions to this problem where Poisson errors or the same error is used everywhere, like this one, but that does not solve my problem.
The error of one bin should mathematically be calculated as err(bin) = sqrt( sum {w_i^2} ) where w_i are the individual weights of the events that belong in that bin.
A simplified example of my histogram is given below.
import matplotlib.pyplot as plt
data=[1,8,5,4,1,10,8,3,6,7]
weights=[1.3,0.2,0.01,0.9,0.4,1.05,0.6,0.6,0.8,1.8]
plt.hist(data, bins = [0.0,2.5,5.0,7.5,10.0], weights=weights)
plt.show()
You have to manually compute the errors for each bin and plot that separately.
import matplotlib.pyplot as plt # type: ignore
import numpy as np # type: ignore
data = np.array([1, 8, 5, 4, 1, 10, 8, 3, 6, 7])
weights = np.array([1.3, 0.2, 0.01, 0.9, 0.4, 1.05, 0.6, 0.6, 0.8, 1.8])
bin_edges = [0.0, 2.5, 5.0, 7.5, 10.0]
bin_y, _, bars = plt.hist(data, bins=bin_edges, weights=weights)
print(f"bin_y {bin_y}")
print(f"bin_edges {bin_edges}")
errors = []
bin_centers = []
for bin_index in range(len(bin_edges) - 1):
# find which data points are inside this bin
bin_left = bin_edges[bin_index]
bin_right = bin_edges[bin_index + 1]
in_bin = np.logical_and(bin_left < data, data <= bin_right)
print(f"in_bin {in_bin}")
# filter the weights to only those inside the bin
weights_in_bin = weights[in_bin]
print(f"weights_in_bin {weights_in_bin}")
# compute the error however you want
error = np.sqrt(np.sum(weights_in_bin ** 2))
errors.append(error)
print(f"error {error}")
# save the center of the bins to plot the errorbar in the right place
bin_center = (bin_right + bin_left) / 2
bin_centers.append(bin_center)
print(f"bin_center {bin_center}")
# plot the error bars
plt.errorbar(bin_centers, bin_y, yerr=errors, linestyle="none")
plt.show()
Which produces this
By the time you added the edits I had done the plot with the stddev for each bin, just change errors to stddevs computed as
data_in_bin = data[in_bin]
variance = np.average((data_in_bin - bin_center) ** 2, weights=weights_in_bin)
stddev = np.sqrt(variance)
print(f"stddev {stddev}")
stddevs.append(stddev)
But you should check that the stddev computation makes sense for your use case. This results in :
Cheers!

scipy curve_fit returns initial estimates

To fit a hyperbolic function I am trying to use the following code:
import numpy as np
from scipy.optimize import curve_fit
def hyperbola(x, s_1, s_2, o_x, o_y, c):
# x > Input x values
# s_1 > slope of line 1
# s_2 > slope of line 2
# o_x > x offset of crossing of asymptotes
# o_y > y offset of crossing of asymptotes
# c > curvature of hyperbola
b_2 = (s_1 + s_2) / 2
b_1 = (s_2 - s_1) / 2
return o_y + b_1 * (x - o_x) + b_2 * np.sqrt((x - o_x) ** 2 + c ** 2 / 4)
min_fit = np.array([-3.0, 0.0, -2.0, -10.0, 0.0])
max_fit = np.array([0.0, 3.0, 3.0, 0.0, 10.0])
guess = np.array([-2.5/3.0, 4/3.0, 1.0, -4.0, 0.5])
vars, covariance = curve_fit(f=hyperbola, xdata=n_step, ydata=n_mean, p0=guess, bounds=(min_fit, max_fit))
Where n_step and n_mean are measurement values generated earlier on. The code runs fine and gives no error message, but it only returns the initial guess with a very small change. Also, the covariance matrix contains only zeros. I tried to do the same fit with a better initial guess, but that does not have any influence.
Further, I plotted the exact same function with the initial guess as input and that gives me indeed a function which is close to the real values. Does anyone know where I make a mistake here? Or do I use the wrong function to make my fit?
The issue must lie with n_step and n_mean (which are not given in the question as currently stated); when trying to reproduce the issue with some arbitrarily chosen set of input parameters, the optimization works as expected. Let's try it out.
First, let's define some arbitrarily chosen input parameters in the given parameter space by
params = [-0.1, 2.95, -1, -5, 5]
Let's see what that looks like:
import matplotlib.pyplot as plt
xs = np.linspace(-30, 30, 100)
plt.plot(xs, hyperbola(xs, *params))
Based on this, let us define some rather crude inputs for xdata and ydata by
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xs, *params)
With these, let us run the optimization and see if we match our given parameters:
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [-0.1 2.95 -1. -5. 5. ]
That is, the fit is perfect even though our params are rather different from our guess. In other words, if we are free to choose n_step and n_mean, then the method works as expected.
In order to try to challenge the optimization slightly, we could also try to add a bit of noise:
np.random.seed(42)
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xdata, *params) + np.random.normal(0, 10, size=len(xdata))
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [ -1.18173287e-01 2.84522636e+00 -1.57023215e+00 -6.90851334e-12 6.14480856e-08]
plt.plot(xdata, ydata, '.')
plt.plot(xs, hyperbola(xs, *vars))
Here we note that the optimum ends up being different from both our provided params and the guess, still within the bounds provided by min_fit and max_fit, and still provided a good fit.

how make a density plot of the eigenvalues of a symbolic matrix in python

I want to make a colour plot of the difference of the two first eigenvalues of that matrix. In order to do this, first I have defined a symbolic matrix with two parametters "x" and "y". Then I obtain the eigenvectors and eigenvalues (shorted) and compute the gap beetwen the two first eigenvalues . Finally (and I think that here is the problem...) I make a grid of points X and Y in order to evaluate it with the function "energy_gap(x,y)" storing the result in Z and then using this in order to do the plot, but it doesn't work....Any idea why?
import numpy as np
import numpy
import matplotlib.pyplot as plt
from sympy.utilities.lambdify import lambdify
from sympy import symbols
x = symbols("x")
y = symbols("y")
matrix = [[x+2, x,y],[y**2,x,3],[y+4,2,1]]
simbolic_matrix = lambdify((x,y), matrix,'numpy')
def eigen_system(x,y):
values, vectors = numpy.linalg.eig(np.array(simbolic_matrix(x,y)))
values_short = np.sort(values)
vectors_short = vectors[:,values.argsort()]
return values_short , vectors_short
def energy_gap(x,y):
values , vectors = eigen_system(x,y)
gap = abs(values[1])-abs(values[0])
return gap
def plot_energy_gap():
x = np.arange(1.1, 3.0, 0.1)
y = np.arange(1.1, 3.0, 0.1)
X, Y = np.meshgrid(x, y)
Z = energy_gap(X,Y)
im = plt.imshow(Z, cmap=plt.cm.RdBu,extent=(1.1,3,1.1,3))
plt.colorbar(im)
plt.show()
plot_energy_gap()
Ok, after some extensive testing, I'm afraid I've come to the conclusion that numpys Eigen stuff calculator can operate on a mesh of matrices like you're trying. The best solution I could get was creating the mesh manuaslly:
def plot_energy_gap()
Z = []
for x in np.arange(1.1, 3.0, 0.1):
Z.append([])
for y in np.arange(1.1, 3.0, 0.1):
Z[-1].append(energy_gap(x, y))
im = plt.imshow(Z, cmap=plt.cm.RdBu,extent=(1.1,3,1.1,3))
plt.colorbar(im)
Maybe someone else can vectorize this. EDIT The one line version (forgot it):
Z = [[energy_gap(x, y) for y in np.arange(1.1, 3.0, 0.1)] for x in np.arange(1.1, 3.0, 0.1)]]

Linear regression with leastsq() and global minimum not found

In Python scipy.optimize.leastsq() is normally used for non-linear regression. However, leastsq() should in principle be expected to work with linear fitting functions also. Here appears to be a simple linear regression problem that leastsq() apparently fails to solve properly. Data is fitted with the line y=mx.
Code sample is at the bottom of the post. When plot_real_data = False, then 100 points of linearly correlated data are generated randomly. Here leastsq() can effectively find the minimum of the sum-squared error function:
Graph of correct solution
However, when plot_real_data = True, then 100 data points are taken from a real data set. Here, leastsq() cannot, for some unknown reason, find the minimum of the sum-squared error function:
Graph of incorrect solution
leastsq() consistently reports an optimal gradient parameter m=1.082, regardless of the initial guess of the gradient. However m=1.082 is not the global minimum. The proper value is closer to m=1.25:
print sum(errorfunc([1.0], x, y))
3.9511006207
print sum(errorfunc([1.08], x, y))
3.59052114948
print sum(errorfunc([1.25], x, y))
3.37109033259 (near the minimum)
print sum(errorfunc([1.4], x, y))
3.79503789072
This is puzzling behaviour. In this case, the sum squared error function is a simple quadratic and there is no risk of local minima.
I know that direct methods exist for linear regression, but any ideas on this issue with leastsq()?
Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
Scipy version 0.17.0
CODE:
from __future__ import division
import matplotlib.pyplot as plt
import numpy
import random
from scipy.optimize import leastsq
def errorfunc(params, x_data, y_data) :
"""
Return error at each x point, to a straight line of gradient m
This 1-parameter error function has a clearly defined minimum
"""
squared_errors = []
for i, lm in enumerate(x_data) :
predicted_um = lm * params[0]
squared_errors.append((y_data[i] - predicted_um)**2)
return squared_errors
plt.figure()
###################################################################
# STEP 1: make a scatter plot of the data
plot_real_data = True
###################################################################
if plot_real_data :
# 100 points of real data
x = [0.85772, 0.17135, 0.03401, 0.17227, 0.17595, 0.1742, 0.22454, 0.32792, 0.19036, 0.17109, 0.16936, 0.17357, 0.6841, 0.24588, 0.22913, 0.28291, 0.19845, 0.3324, 0.66254, 0.1766, 0.47927, 0.47999, 0.50301, 0.16035, 0.65964, 0.0, 0.14308, 0.11648, 0.10936, 0.1983, 0.13352, 0.12471, 0.29475, 0.25212, 0.08334, 0.07697, 0.82263, 0.28078, 0.24192, 0.25383, 0.26707, 0.26457, 0.0, 0.24843, 0.26504, 0.24486, 0.0, 0.23914, 0.76646, 0.66567, 0.62966, 0.61771, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.79157, 0.06889, 0.07669, 0.1372, 0.11681, 0.11103, 0.13577, 0.07543, 0.10636, 0.09176, 0.10941, 0.08327, 1.19903, 0.20987, 0.21103, 0.21354, 0.26011, 0.28862, 0.28441, 0.2424, 0.29196, 0.20248, 0.1887, 0.20045, 1.2041, 0.20687, 0.22448, 0.23296, 0.25434, 0.25832, 0.25722, 0.24378, 0.24035, 0.17912, 0.18058, 0.13556, 0.97535, 0.25504, 0.20418, 0.22241]
y = [1.13085, 0.19213, 0.01827, 0.20984, 0.21898, 0.12174, 0.38204, 0.31002, 0.26701, 0.2759, 0.26018, 0.24712, 1.18352, 0.29847, 0.30622, 0.5195, 0.30406, 0.30653, 1.13126, 0.24761, 0.81852, 0.79863, 0.89171, 0.19251, 1.33257, 0.0, 0.19127, 0.13966, 0.15877, 0.19266, 0.12997, 0.13133, 0.25609, 0.43468, 0.09598, 0.08923, 1.49033, 0.27278, 0.3515, 0.38368, 0.35134, 0.37048, 0.0, 0.3566, 0.36296, 0.35054, 0.0, 0.32712, 1.23759, 1.02589, 1.02413, 0.9863, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.19224, 0.12192, 0.12815, 0.2672, 0.21856, 0.14736, 0.20143, 0.1452, 0.15965, 0.14342, 0.15828, 0.12247, 0.5728, 0.10603, 0.08939, 0.09194, 0.1145, 0.10313, 0.13377, 0.09734, 0.12124, 0.11429, 0.09536, 0.11457, 0.76803, 0.10173, 0.10005, 0.10541, 0.13734, 0.12192, 0.12619, 0.11325, 0.1092, 0.11844, 0.11373, 0.07865, 1.28568, 0.25871, 0.22843, 0.26608]
else :
# 100 points of test data with noise added
x_clean = numpy.linspace(0,1.2,100)
y_clean = [ i * 1.38 for i in x_clean ]
x = [ i + random.uniform(-1 * random.uniform(0, 0.1), random.uniform(0, 0.1)) for i in x_clean ]
y = [ i + random.uniform(-1 * random.uniform(0, 0.5), random.uniform(0, 0.5)) for i in y_clean ]
plt.subplot(2,1,1)
plt.scatter(x,y); plt.xlabel('x'); plt.ylabel('y')
# STEP 2: vary gradient m of a y = mx fitting line
# plot sum squared error with respect to gradient m
# here you can see by eye, the optimal gradient of the fitting line
plt.subplot(2,1,2)
try_m = numpy.linspace(0.1,4,200)
sse = [ sum(errorfunc([m], x, y)) for m in try_m ]
plt.plot(try_m,sse); plt.xlabel('line gradient, m'); plt.ylabel('sum-squared error')
# STEP 3: use leastsq() to find optimal gradient m
params = [2] # start with initial guess of 2 for gradient
params_fitted, cov, infodict, mesg, ier = leastsq(errorfunc, params[:], args=(x, y), full_output=1)
optimal_m = params_fitted[0]
print optimal_m
# optimal gradient m should be the minimum of the error function
plt.subplot(2,1,2)
plt.plot([optimal_m,optimal_m],[0,100], 'r')
# optimal gradient m should give best fit straight line
plt.subplot(2,1,1)
plt.plot([0, 1.2],[0, 1.2 * optimal_m],'r')
plt.show()

Fit a non-linear function to data/observations with pyMCMC/pyMC

I am trying to fit some data with a Gaussian (and more complex) function(s). I have created a small example below.
My first question is, am I doing it right?
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
It is very hard to find nice guides on how to do this kind of regression in pyMC. Perhaps because its easier to use some least squares, or similar approach, I however have many parameters in the end and need to see how well we can constrain them and compare different models, pyMC seemed like the good choice for that.
import pymc
import numpy as np
import matplotlib.pyplot as plt; plt.ion()
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
# Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
# define the model/function to be fitted.
def model(x, f):
amp = pymc.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pymc.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pymc.Normal('ps', 0.13, 40, value=0.15)
#pymc.deterministic(plot=False)
def gauss(x=x, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pymc.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pymc.MCMC(model(x,f))
MDL.sample(1e4)
# extract and plot results
y_min = MDL.stats()['gauss']['quantiles'][2.5]
y_max = MDL.stats()['gauss']['quantiles'][97.5]
y_fit = MDL.stats()['gauss']['mean']
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=2, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
I realize that I might have to run more iterations, use burn in and thinning in the end. The figure plotting the data and the fit is seen here below.
The pymc.Matplot.plot(MDL) figures looks like this, showing nicely peaked distributions. This is good, right?
My first question is, am I doing it right?
Yes! You need to include a burn-in period, which you know. I like to throw out the first half of my samples. You don't need to do any thinning, but sometimes it will make your post-MCMC work faster to process and smaller to store.
The only other thing I advise is to set a random seed, so that your results are "reproducible": np.random.seed(12345) will do the trick.
Oh, and if I was really giving too much advice, I'd say import seaborn to make the matplotlib results a little more beautiful.
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
One way is to include a latent variable for each error. This works in your example, but will not be feasible if you have many more observations. I'll give a little example to get you started down this road:
# add noise to observed x values
x_obs = pm.rnormal(mu=x, tau=(1e4)**-2)
# define the model/function to be fitted.
def model(x_obs, f):
amp = pm.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pm.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pm.Normal('ps', 0.13, 40, value=0.15)
x_pred = pm.Normal('x', mu=x_obs, tau=(1e4)**-2) # this allows error in x_obs
#pm.deterministic(plot=False)
def gauss(x=x_pred, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pm.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pm.MCMC(model(x_obs, f))
MDL.use_step_method(pm.AdaptiveMetropolis, MDL.x_pred) # use AdaptiveMetropolis to "learn" how to step
MDL.sample(200000, 100000, 10) # run chain longer since there are more dimensions
It looks like it may be hard to get good answers if you have noise in x and y:
Here is a notebook collecting this all up.
EDIT: Important note
This has been bothering me for a while now. The answers given by myself and Abraham here are correct in the sense that they add variability to x. HOWEVER: Note that you cannot simply add uncertainty in this way to cancel out the errors you have in your x-values, so that you regress against "true x". The methods in this answer can show you how adding errors to x affects your regression if you have the true x. If you have a mismeasured x, these answers will not help you. Having errors in the x-values is a very tricky problem to solve, as it leads to "attenuation" and an "errors-in-variables effect". The short version is: having unbiased, random errors in x leads to bias in your regression estimates. If you have this problem, check out Carroll, R.J., Ruppert, D., Crainiceanu, C.M. and Stefanski, L.A., 2006. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC., or for a Bayesian approach, Gustafson, P., 2003. Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments. CRC Press. I ended up solving my specific problem using Carroll et al.'s SIMEX method along with PyMC3. The details are in Carstens, H., Xia, X. and Yadavalli, S., 2017. Low-cost energy meter calibration method for measurement and verification. Applied energy, 188, pp.563-575. It is also available on ArXiv
I converted Abraham Flaxman's answer above into PyMC3, in case someone needs it. Some very minor changes, but can be confusing nevertheless.
The first is that the deterministic decorator #Deterministic is replaced by a distribution-like call function var=pymc3.Deterministic(). Second, when generating a vector of normally distributed random variables,
rvs = pymc2.rnormal(mu=mu, tau=tau)
is replaced by
rvs = pymc3.Normal('var_name', mu=mu, tau=tau,shape=size(var)).random()
The complete code is as follows:
import numpy as np
from pymc3 import *
import matplotlib.pyplot as plt
# set random seed for reproducibility
np.random.seed(12345)
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
#Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
with Model() as model3:
amp = Uniform('amp', 0.05, 0.4, testval= 0.15)
size = Uniform('size', 0.5, 2.5, testval= 1.0)
ps = Normal('ps', 0.13, 40, testval=0.15)
gauss=Deterministic('gauss',amp*np.exp(-1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.)))+ps)
y =Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
trace=sample(2000,start=start)
# extract and plot results
y_min = np.percentile(trace.gauss,2.5,axis=0)
y_max = np.percentile(trace.gauss,97.5,axis=0)
y_fit = np.percentile(trace.gauss,50,axis=0)
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=1, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
Which results in
y_error
For errors in x (note the 'x' suffix to variables):
# define the model/function to be fitted in PyMC3:
with Model() as modelx:
x_obsx = pm3.Normal('x_obsx',mu=x, tau=(1e4)**-2, shape=40)
ampx = Uniform('ampx', 0.05, 0.4, testval=0.15)
sizex = Uniform('sizex', 0.5, 2.5, testval=1.0)
psx = Normal('psx', 0.13, 40, testval=0.15)
x_pred = Normal('x_pred', mu=x_obsx, tau=(1e4)**-2*np.ones_like(x_obsx),testval=5*np.ones_like(x_obsx),shape=40) # this allows error in x_obs
gauss=Deterministic('gauss',ampx*np.exp(-1*(np.pi**2*sizex*x_pred/(3600.*180.))**2/(4.*np.log(2.)))+psx)
y = Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
tracex=sample(20000,start=start)
Which results in:
x_error_graph
the last observation is that when doing
traceplot(tracex[100:])
plt.tight_layout();
(result not shown), we can see that sizex seems to be suffering from 'attenuation' or 'regression dilution' due to the error in the measurement of x.

Categories

Resources