I have two data sets where two values where measured. I am interested in the difference between the value and the standard deviation of the difference. I made a histogram which I would like to fit two normal distributions. To calculate the difference between the maxima. I also would like to evaluate the effect that in on data set I have much less data on one value. I've already looked at this link but it is not really what I need:
Python: finding the intersection point of two gaussian curves
for ii in range(2,8):
# Kanal = ii - 1
file = filepath + '\Mappe1.txt'
data = np.loadtxt(file, delimiter='\t', skiprows=1)
data = data[:,ii]
plt.hist(data,bins=100)
plt.xlabel("bins")
plt.ylabel("Counts")
plt.tight_layout()
plt.grid()
plt.figure()
plt.show()
Quick and dirty fitting can be readily achieved using scipy:
from scipy.optimize import curve_fit #non linear curve fitting tool
from matplotlib import pyplot as plt
def func2fit(x1,x2,m_1,m_2,std_1,std_2,height1, height2): #define a simple gauss curve
return height1*exp(-(x1-m_1)**2/2/std_1**2)+height2*exp(-(x2-m_2)**2/2/std_2**2)
init_guess=(-.3,.3,.5,.5,3000,3000)
#contains the initial guesses for the parameters (m_1, m_2, std_1, std_2, height1, height2) using your first figure
#do the fitting
fit_pars, pcov =curve_fit(func2fit,xdata,ydata,init_guess)
#fit_pars contains the mean, the heights and the SD values, pcov contains the estimated covariance of these parameters
plt.plot(xdata,func2fit(xdata,*fit_pars),label='fit') #plot the fit
For further reference consult the scipy manual page:
curve_fit
Assuming that the two samples are independent there is no need to handle this problem using curve fitting. It's basic statistics. Here's some code that does the calculations required, with the source attributed in a comment.
## adapted from http://onlinestatbook.com/2/estimation/difference_means.html
from random import gauss
from numpy import sqrt
sample_1 = [ gauss(0,1) for _ in range(10) ]
sample_2 = [ gauss(1,.5) for _ in range(20) ]
n_1 = len(sample_1)
n_2 = len(sample_2)
mean_1 = sum(sample_1)/n_1
mean_2 = sum(sample_2)/n_2
SSE = sum([(_-mean_1)**2 for _ in sample_1]) + sum([(_-mean_2)**2 for _ in sample_2])
df = (n_1-1) + (n_2-1)
MSE = SSE/df
n_h = 2 / ( 1/n_1 + 1/n_2 )
s_mean_diff = sqrt( 2* MSE / n_h )
print ( 'difference between means', abs(n_1-n_2))
print ( 'std dev of this difference', s_mean_diff )
Related
I have two arrays with x- and y- data.
This data shows lognormal behavior. I need a graph of the fit, as well as the mu and the sigma to do some statistics.
I did a fit, in order to calculate the mu, the sigma, and further on some statistical values of it. (See code below)
I obtain the scaling factor, with which I have to multiply the distribution with an integral over the datapoints.
The code below, does work. My question now is, if (I am sure) there is a better way to do this? It feels like a workaround, that will work sometimes. I want a better way to do this, because I have to plot hundreds of these.
My code (sorry, that it is this long, wanted to include everything except import of crude data):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# produce plot True/False
ploton = True
x0=np.array([3.58381e+01, 3.27125e+01, 2.98680e+01, 2.72888e+01, 2.49364e+01,
2.27933e+01, 2.08366e+01, 1.90563e+01, 1.74380e+01, 1.59550e+01,
1.45904e+01, 1.33460e+01, 1.22096e+01, 1.11733e+01, 1.02262e+01,
9.35893e+00, 8.56556e+00, 7.86688e+00, 7.20265e+00, 6.59782e+00,
6.01571e+00, 5.53207e+00, 5.03979e+00, 4.64415e+00, 4.19920e+00,
3.83595e+00, 3.50393e+00, 3.28070e+00, 3.00930e+00, 2.75634e+00,
2.52050e+00, 2.31349e+00, 2.12280e+00, 1.92642e+00, 1.77820e+00,
1.61692e+00, 1.49094e+00, 1.36233e+00, 1.22935e+00, 1.14177e+00,
1.03078e+00, 9.39603e-01, 8.78425e-01, 1.01490e+00, 1.07461e-01,
4.81523e-02, 4.81523e-02, 1.00000e-02, 1.00000e-02])
y0=np.array([3.94604811e+04, 2.78223936e+04, 1.95979179e+04, 2.14447807e+04,
1.68677487e+04, 1.79429516e+04, 1.73589776e+04, 2.16101026e+04,
3.79705638e+04, 6.83622301e+04, 1.73687772e+05, 5.74854475e+05,
1.69497465e+06, 3.79135941e+06, 7.76757753e+06, 1.33429094e+07,
1.96096415e+07, 2.50403065e+07, 2.72818618e+07, 2.53120387e+07,
1.93102362e+07, 1.22219224e+07, 4.96725699e+06, 1.61174658e+06,
3.19352386e+05, 1.80305856e+05, 1.41728002e+05, 1.66191809e+05,
1.33223816e+05, 1.31384905e+05, 2.49100945e+05, 2.28300583e+05,
3.01063903e+05, 1.84271914e+05, 1.26412781e+05, 8.57488083e+04,
1.35536571e+05, 4.50076293e+04, 1.98080100e+05, 2.27630303e+05,
1.89484527e+05, 0.00000000e+00, 1.36543525e+05, 2.20677520e+05,
3.60100586e+05, 1.62676486e+05, 1.90105093e+04, 9.27461467e+05,
1.58373542e+05])
Dnm = x0
dndlndp = y0
#lognormal PDF:
def f(x, mu, sigma) :
return 1/(np.sqrt(2*np.pi)*sigma*x)*np.exp(-((np.log(x)-mu)**2)/(2*sigma**2))
#normalizing y-values to obtain lognormal distributed data:
y0_normalized = y0/np.trapz(x0.ravel(), y0.ravel())
#calculating mu/sigma of this distribution:
params, extras = curve_fit(f, x0.ravel(), y0_normalized.ravel())
median = np.exp(params[0])
mu = params[0]
sigma = params[1]
#output of mu / sigma / calculated median:
print "mu=%g, sigma=%g" % (params[0], params[1])
print "median=%g" % median
#new variable z for smooth fit-curve:
z = np.linspace(0.1, 100, 10000)
#######################
Dnm = np.ravel(Dnm)
dndlndp = np.ravel(dndlndp)
Dnm_rev = list(reversed(Dnm))
dndlndp_rev = list(reversed(dndlndp))
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
#####################
#plotting
if ploton:
plt.plot(z, f(z, mu, sigma)*scalingfactor, label="fit", color = "red")
plt.scatter(x0, y0, label="data")
plt.xlim(3,20)
plt.xscale("log")
plt.legend()
EDIT1: Maybe I should add that I have no idea, why the scaling factor calculated with
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
is right. It was simply try and error. I really want to know, why this does the trick, since the "area" of all bins combined is:
N = np.trapz(dndlndp_rev, np.log(Dnm_rev), dx = np.log(Dnm_rev))
because the width of the bins is log(Dnm).
EDIT2: Thank you for all answers. I copied the arrays into the code, which is now runable. I want to simplify the question, since i think, due to my poor english, i was not able to say what i really want:
I have lognormal set of data. The code above allows me to calculate the mu and the sigma. To do so, i need to normalize the data, and the area under the function is from now on = 1.
In order to plot a lognormal function with the calculated mu and sigma, i need to multiply the function with an (unknown) factor, because the area under the real function is something like 1e8, but sure not one. I did a workaround by calculating this "scalingfactor" via the trapz integral of the diskrete crude data.
There has to be a better way to plot the fitted function, when mu and sigma are already known.
I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.
def get_pdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
x = np.linspace(0, max(data), max(data))
y = ag(x)
return x, y
This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)
a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)
But it gives a value close to 0.5 when I test.
But when I intergrate from -100 to max(data), it provides a value close to 1.
trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)
The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.
So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?
The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.
Quote from the doc:
Bandwidth selection strongly influences the estimate obtained from
the KDE (much more so than the actual shape of the kernel). Bandwidth selection
can be done by a "rule of thumb", by cross-validation, by "plug-in
methods" or by other means; see [3], [4] for reviews. gaussian_kde
uses a rule of thumb, the default is Scott's Rule.
Try with different bandwidth values, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
b = 1
sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)
# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))
# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();
gives:
integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957
and:
We see that the kde using the Scott method is wrong.
I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.
I'm trying to fit several lines sharing the same intercept.
import numpy as np
import pymc
# Observations
a_actual = np.array([[2., 5., 7.]]).T
b_actual = 3.
t = np.arange(100)
obs = np.random.normal(a_actual * t + b_actual)
# PyMC Model
def model_linear():
b = pymc.Uniform('b', value=1., lower=0, upper=200)
a = []
s = []
r = []
for i in range(len(a_actual)):
s.append(pymc.Uniform('sigma_{}'.format(i), value=1., lower=0, upper=100))
a.append(pymc.Uniform('a_{}'.format(i), value=1., lower=0, upper=200))
r.append(pymc.Normal('r_{}'.format(i), mu=a[i] * t + b, tau=1/s[i]**2, value=obs[i], observed=True))
return [pymc.Container(a), b, pymc.Container(s), pymc.Container(r)]
model = pymc.Model(model_linear())
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
The computed MAP estimates are far from the actual values. Those values are also very sensitive to the lower and upper bounds of sigmas and a, to the actual values of a (e.g. a = [.2, .5, .7] will give me good estimates) or to the number of lines to do the regression on.
Is this the right way of performing my linear regressions?
ps : I tried to use an Exponential prior distribution for sigmas but results were not better.
I think using MAP might not be your best bet. If you are able to do a proper sampling then consider replacing the last 3 lines of your example code with
MCMClinear = pymc.MCMC( model)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()
Printing the linear_output for this gives very accurate inferences for the parameters.
I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter.
Here is my code:
# Imports
import scipy
import pylab
# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")
# Plot estimates using splines with a range of degrees
for k in range(1, 4):
mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
xi = range(0, 15100, 20)
yi = mySpline(xi)
pylab.plot(xi, yi, label="Predicted k=%d" % k)
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()
Here is the result:
I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over.
Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?
Short answer: you need to choose the value for s more carefully.
The documentation for UnivariateSpline states that:
Positive smoothing factor used to choose the number of knots. Number of
knots will be increased until the smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s
From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.
#Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact).
# Imports
import numpy
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()
While I'm not aware of any library which will do it for you off-hand, I'd try a bit more DIY approach: I'd start from making a spline with knots in between the raw data points, in both x and y. In your particular example, having a single knot in between the 4th and 5th points should do the trick, since it'd remove the huge derivative at around x=8000.
I had trouble getting BigChef's answer running, here is a variation that works on python 3.6:
# Imports
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
import numpy
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(key=lambda li: li[0])
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.show()