I am currently trying to learn Bayesian Statistics and how to implement it with pymc for a project at work.
So far I'm just playing around with this linear regression tutorial.
Here's the code snippet to generate y_observed:
size = 200
true_intercept = 1
true_slope = 2
x = np.linspace(0,1,size)
# y = a + b*x
true_y = true_intercept + true_slope * x
y_observed = true_y + np.random.normal(scale=0.5, size=size)
And here's the plot of y_observed and true_y:
Then, following the tutorial I fit the model as
with pm.Model() as model:
sigma = pm.HalfCauchy("sigma", beta=10)
intercept = pm.Normal("intercept", 0, sigma=20)
slope = pm.Normal("slope", 0, sigma=20)
likelihood = pm.Normal("y", mu = intercept + slope*x, sigma=sigma, observed=y_observed)
idata = pm.sample(3000)
My question then is... how should I represent the posterior that I got (idata.posterior) in my graph?
Considering that the posterior was sampled from 4 different chains, I thought in calculating the mean of my intercept, slope and sigma; and plot them plus and minus the std of sigma... but that gave me a very bad line, which makes me believe that is wrong.
Below my code snippet for what I just mentioned and the graph that I got:
def plot(x, true_y, true_line,idata):
intercept_mean = float(idata.posterior.intercept.mean()) # get the mean of the intercept (from all 4 chains)
slope_mean = float(idata.posterior.slope.mean()) # get the mean of the slope (from all 4 chains)
sigma_mean = float(idata.posterior.sigma.mean()) # the mean of the error (from all 4 chains)
sigma_std = float(idata.posterior.sigma.std()) # the std of the error (from all 4 chains)
y_pred_no_std = intercept_mean + slope_mean*x + sigma_mean
y_pred_plus = y_pred_no_std + sigma_std
y_pred_minus = y_pred_no_std - sigma_std
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x, true_line, "x", label="y_observed")
ax.plot(x, true_y, label="true regression line", lw=2.0)
ax.plot(x, y_pred_plus, label="pymc regression mean + std")
ax.plot(x, y_pred_no_std, label='pymc regression mean')
ax.plot(x, y_pred_minus, label="pymc regression mean - std")
I'd like to make a Gaussian Fit for some data that has a rough gaussian fit. I'd like the information of data peak (A), center position (mu), and standard deviation (sigma), along with 95% confidence intervals for these values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.stats import norm
# gaussian function
def gaussian_func(x, A, mu, sigma):
return A * np.exp( - (x - mu)**2 / (2 * sigma**2))
# generate toy data
x = np.arange(50)
y = [ 97.04421053, 96.53052632, 96.85684211, 96.33894737, 96.85052632,
96.30526316, 96.87789474, 96.75157895, 97.05052632, 96.73473684,
96.46736842, 96.23368421, 96.22526316, 96.11789474, 96.41263158,
96.32631579, 96.33684211, 96.44421053, 96.48421053, 96.49894737,
97.30105263, 98.58315789, 100.07368421, 101.43578947, 101.92210526,
102.26736842, 101.80421053, 101.91157895, 102.07368421, 102.02105263,
101.35578947, 99.83578947, 98.28, 96.98315789, 96.61473684,
96.82947368, 97.09263158, 96.82105263, 96.24210526, 95.95578947,
95.84210526, 95.67157895, 95.83157895, 95.37894737, 95.25473684,
95.32842105, 95.45684211, 95.31578947, 95.42526316, 95.30526316]
# initial_guess_of_parameters
# この値はソルバーとかで求めましょう.
parameter_initial = np.array([652, 2.9, 1.3])
# estimate optimal parameter & parameter covariance
popt, pcov = curve_fit(gaussian_func, x, y, p0=parameter_initial)
# plot result
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_func(xd, popt[0], popt[1], popt[2])
plt.plot(xd, estimated_curve, label="Estimated curve", color="r")
# estimate standard Error
StdE = np.sqrt(np.diag(pcov))
# estimate 95% confidence interval
lwCI = popt + norm.ppf(q=alpha)*StdE
upCI = popt + norm.ppf(q=1-alpha)*StdE
# print result
mat = np.vstack((popt,StdE, lwCI, upCI)).T
df=pd.DataFrame(mat,index=("A", "mu", "sigma"),
columns=("Estimate", "Std. Error", "lwCI", "upCI"))
Data Plot with Fitted Curve
The data peak and center position seems correct, but the standard deviation is off. Any input is greatly appreciated.
Your scatter indeed looks similar to a gaussian distribution, but it is not centered around zero. Given the specifics of the Gaussian function it will therefor be hard to nicely fit a Gaussian distribution to the data the way you gave us. I would therefor propose by starting with demeaning the x series:
x = np.arange(0, 50) - 24.5
Next I would add one additional parameter to your gaussian function, the offset. Since the regular Gaussian function will always have its tails close to zero it is impossible to otherwise nicely fit your scatterplot:
def gaussian_function(x, A, mu, sigma, offset):
return A * np.exp(-np.power((x - mu)/sigma, 2.)/2.) + offset
Next you should define an error_loss_function to minimise:
def error_loss_function(params):
gaussian = gaussian_function(x, params[0], params[1], params[2], params[3])
errors = gaussian - y
return sum(np.power(errors, 2)) # You can also pick a different error loss function!
All that remains is fitting our curve now:
fit = scipy.optimize.minimize(fun=error_loss_function, x0=[2, 0, 0.2, 97])
params = fit.x # A: 6.57592661, mu: 1.95248855, sigma: 3.93230503, offset: 96.12570778
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_function(xd, params[0], params[1], params[2], params[3])
plt.plot(xd, estimated_curve, label="Estimated curve", color="b")
Hopefully this helps. Looks like a fun project, let me know if my answer is not clear.
How can I find and plot a LOWESS curve that looks like the following using Python?
I'm aware of the LOWESS implementation in statsmodels, but it doesn't seem to be able to give me 95% confidence interval lines that I can shade between. Seaborn has a method that calls the statsmodels implementation, but it can't plot the confidence intervals.
Other StackOverflow answers give code to draw a LOESS/LOWESS line, but none with a confidence interval. Can anyone assist with this? Is anyone aware of an existing implementation that would enable me to do this?
Thanks in advance.
I found a link here is useful, and I put code below:
def lowess(x, y, f=1./3.):
Basic LOWESS smoother with uncertainty.
- Not robust (so no iteration) and
only normally distributed errors.
- No higher order polynomials d=1
so linear smoother.
# get some paras
xwidth = f*(x.max()-x.min()) # effective width after reduction factor
N = len(x) # number of obs
# Don't assume the data is sorted
order = np.argsort(x)
# storage
y_sm = np.zeros_like(y)
y_stderr = np.zeros_like(y)
# define the weigthing function -- clipping too!
tricube = lambda d : np.clip((1- np.abs(d)**3)**3, 0, 1)
# run the regression for each observation i
for i in range(N):
dist = np.abs((x[order][i]-x[order]))/xwidth
w = tricube(dist)
# form linear system with the weights
A = np.stack([w, x[order]*w]).T
b = w * y[order]
ATb =
# solve the syste
sol = np.linalg.solve(ATA, ATb)
# predict for the observation only
yest = A[i].dot(sol)# equiv of just for k
place = order[i]
sigma2 = (np.sum(( -y [order])**2)/N )
# Calculate the standard error
y_stderr[place] = np.sqrt(sigma2 *
return y_sm, y_stderr
import numpy as np
import matplotlib.pyplot as plt
# make some data
x = 5*np.random.random(100)
y = np.sin(x) * 3*np.exp(-x) + np.random.normal(0, 0.2, 100)
order = np.argsort(x)
#run it
y_sm, y_std = lowess(x, y, f=1./5.)
# plot it
plt.plot(x[order], y_sm[order], color='tomato', label='LOWESS')
plt.fill_between(x[order], y_sm[order] - 1.96*y_std[order],
y_sm[order] + 1.96*y_std[order], alpha=0.3, label='LOWESS uncertainty')
plt.plot(x, y, 'k.', label='Observations')
#run it
y_sm, y_std = lowess(x, y, f=1./5.)
# plot it
plt.plot(x[order], y_sm[order], color='tomato', label='LOWESS')
plt.fill_between(x[order], y_sm[order] - y_std[order],
y_sm[order] + y_std[order], alpha=0.3, label='LOWESS uncertainty')
plt.plot(x, y, 'k.', label='Observations')
I have been trying to figure out the full width half maximum (FWHM) of the the blue peak (see image). The green peak and the magenta peak combined make up the blue peak. I have been using the following equation to find the FWHM of the green and magenta peaks: fwhm = 2*np.sqrt(2*(math.log(2)))*sd where sd = standard deviation. I created the green and magenta peaks and I know the standard deviation which is why I can use that equation.
I created the green and magenta peaks using the following code:
def make_norm_dist(self, x, mean, sd):
import numpy as np
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
If I did not know the blue peak was made up of two peaks and I only had the blue peak in my data, how would I find the FWHM?
I have been using this code to find the peak top:
peak_top = 0.0e-1000
for i in x_axis:
if i > peak_top:
peak_top = i
I could divide the peak_top by 2 to find the half height and then try and find y-values corresponding to the half height, but then I would run into trouble if there are no x-values exactly matching the half height.
I am pretty sure there is a more elegant solution to the one I am trying.
You can use spline to fit the [blue curve - peak/2], and then find it's roots:
import numpy as np
from scipy.interpolate import UnivariateSpline
def make_norm_dist(x, mean, sd):
return 1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x - mean)**2/(2*sd**2))
x = np.linspace(10, 110, 1000)
green = make_norm_dist(x, 50, 10)
pink = make_norm_dist(x, 60, 10)
blue = green + pink
# create a spline of x and blue-np.max(blue)/2
spline = UnivariateSpline(x, blue-np.max(blue)/2, s=0)
r1, r2 = spline.roots() # find the roots
import pylab as pl
pl.plot(x, blue)
pl.axvspan(r1, r2, facecolor='g', alpha=0.5)
Here is the result:
This worked for me in iPython (quick and dirty, can be reduced to 3 lines):
def FWHM(X,Y):
half_max = max(Y) / 2.
#find when function crosses line half_max (when sign of diff flips)
#take the 'derivative' of signum(half_max - Y[])
d = sign(half_max - array(Y[0:-1])) - sign(half_max - array(Y[1:]))
#plot(X[0:len(d)],d) #if you are interested
#find the left and right most indexes
left_idx = find(d > 0)[0]
right_idx = find(d < 0)[-1]
return X[right_idx] - X[left_idx] #return the difference (full width)
Some additions can be made to make the resolution more accurate, but in the limit that there are many samples along the X axis and the data is not too noisy, this works great.
Even when the data are not Gaussian and a little noisy, it worked for me (I just take the first and last time half max crosses the data).
If your data has noise (and it always does in the real world), a more robust solution would be to fit a Gaussian to the data and extract FWHM from that:
import numpy as np
import scipy.optimize as opt
def gauss(x, p): # p[0]==mean, p[1]==stdev
return 1.0/(p[1]*np.sqrt(2*np.pi))*np.exp(-(x-p[0])**2/(2*p[1]**2))
# Create some sample data
known_param = np.array([2.0, .7])
xmin,xmax = -1.0, 5.0
N = 1000
X = np.linspace(xmin,xmax,N)
Y = gauss(X, known_param)
# Add some noise
Y += .10*np.random.random(N)
# Renormalize to a proper PDF
Y /= ((xmax-xmin)/N)*Y.sum()
# Fit a guassian
p0 = [0,1] # Inital guess is a normal distribution
errfunc = lambda p, x, y: gauss(x, p) - y # Distance to the target function
p1, success = opt.leastsq(errfunc, p0[:], args=(X, Y))
fit_mu, fit_stdev = p1
FWHM = 2*np.sqrt(2*np.log(2))*fit_stdev
print "FWHM", FWHM
The plotted image can be generated by:
from pylab import *
plot(X, gauss(X,p1),lw=3,alpha=.5, color='r')
axvspan(fit_mu-FWHM/2, fit_mu+FWHM/2, facecolor='g', alpha=0.5)
An even better approximation would filter out the noisy data below a given threshold before the fit.
Here is a nice little function using the spline approach.
from scipy.interpolate import splrep, sproot, splev
class MultiplePeaks(Exception): pass
class NoPeaksFound(Exception): pass
def fwhm(x, y, k=10):
Determine full-with-half-maximum of a peaked set of points, x and y.
Assumes that there is only one peak present in the datasset. The function
uses a spline interpolation of order k.
half_max = amax(y)/2.0
s = splrep(x, y - half_max, k=k)
roots = sproot(s)
if len(roots) > 2:
raise MultiplePeaks("The dataset appears to have multiple peaks, and "
"thus the FWHM can't be determined.")
elif len(roots) < 2:
raise NoPeaksFound("No proper peaks were found in the data set; likely "
"the dataset is flat (e.g. all zeros).")
return abs(roots[1] - roots[0])
You should use scipy to solve it: first find_peaks and then peak_widths.
With default value in rel_height(0.5) you're measuring the width at half maximum of the peak.
If you prefer interpolation over fitting:
import numpy as np
def get_full_width(x: np.ndarray, y: np.ndarray, height: float = 0.5) -> float:
height_half_max = np.max(y) * height
index_max = np.argmax(y)
x_low = np.interp(height_half_max, y[:index_max+1], x[:index_max+1])
x_high = np.interp(height_half_max, np.flip(y[index_max:]), np.flip(x[index_max:]))
return x_high - x_low
For monotonic functions with many data points and if there's no need for perfect accuracy, I would use:
def FWHM(X, Y):
deltax = x[1] - x[0]
half_max = max(Y) / 2.
l = np.where(y > half_max, 1, 0)
return np.sum(l) * deltax
I implemented an empirical solution which works for noisy and not-quite-Gaussian data fairly well in haggis.math.full_width_half_max. The usage is extremely straightforward:
fwhm = full_width_half_max(x, y)
The function is robust: it simply finds the maximum of the data and the nearest points crossing the "halfway down" threshold using the requested interpolation scheme.
Here are a couple of examples using data from the other answers.
#HYRY's smooth data
def make_norm_dist(x, mean, sd):
return 1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x - mean)**2/(2*sd**2))
x = np.linspace(10, 110, 1000)
green = make_norm_dist(x, 50, 10)
pink = make_norm_dist(x, 60, 10)
blue = green + pink
# create a spline of x and blue-np.max(blue)/2
spline = UnivariateSpline(x, blue-np.max(blue)/2, s=0)
r1, r2 = spline.roots() # find the roots
# Compute using my function
fwhm, (x1, y1), (x2, y2) = full_width_half_max(x, blue, return_points=True)
# Print comparison
print('HYRY:', r2 - r1, 'MP:', fwhm)
plt.plot(x, blue)
plt.axvspan(r1, r2, facecolor='g', alpha=0.5)
plt.plot(x1, y1, 'r.')
plt.plot(x2, y2, 'r.')
For smooth data, the results are pretty exact:
HYRY: 26.891157007233254 MP: 26.891193606203814
#Hooked's Noisy Data
def gauss(x, p): # p[0]==mean, p[1]==stdev
return 1.0/(p[1]*np.sqrt(2*np.pi))*np.exp(-(x-p[0])**2/(2*p[1]**2))
# Create some sample data
known_param = np.array([2.0, .7])
xmin,xmax = -1.0, 5.0
N = 1000
X = np.linspace(xmin,xmax,N)
Y = gauss(X, known_param)
# Add some noise
Y += .10*np.random.random(N)
# Renormalize to a proper PDF
Y /= ((xmax-xmin)/N)*Y.sum()
# Fit a guassian
p0 = [0,1] # Inital guess is a normal distribution
errfunc = lambda p, x, y: gauss(x, p) - y # Distance to the target function
p1, success = opt.leastsq(errfunc, p0[:], args=(X, Y))
fit_mu, fit_stdev = p1
FWHM = 2*np.sqrt(2*np.log(2))*fit_stdev
# Compute using my function
fwhm, (x1, y1), (x2, y2) = full_width_half_max(X, Y, return_points=True)
# Print comparison
print('Hooked:', FWHM, 'MP:', fwhm)
plt.plot(X, Y)
plt.plot(X, gauss(X, p1), lw=3, alpha=.5, color='r')
plt.axvspan(fit_mu - FWHM / 2, fit_mu + FWHM / 2, facecolor='g', alpha=0.5)
plt.plot(x1, y1, 'r.')
plt.plot(x2, y2, 'r.')
For noisy data (with a biased baseline), the results are not as consistent.
Hooked: 1.9903193212254346 MP: 1.5039676990530118
On the one hand the Gaussian fit is not very optimal for the data, but on the other hand, the strategy of picking the nearest point that intersects the half-max threshold is likely not optimal either.
My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws
coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.