Wrong P_value given by ttest_1samp - python

Here is a one sample t-test example:
from scipy.stats import ttest_1samp
import numpy as np
ages = [32., 34., 29., 29., 22., 39., 38., 37.,38, 36, 30, 26, 22, 22.]
ages_mean = np.mean(ages)
ages_std = np.std(ages, ddof=1)
print(ages_mean)
print(ages_std)
ttest, pval = ttest_1samp(ages, 30)
print("ttest: ", ttest)
print("p_value: ", pval)
#31.0
#6.2634470725607025
#ttest: 0.5973799001456603
#p_value: 0.5605155888171379
# check analytically:
my_ttest = (ages_mean - 30.0)/(ages_std/np.sqrt(len(ages)))
print(t)
#0.5973799001456603
check the p_value
by definition p_value = P(t>=0.59) = 1 - P(t<=.59).
Using the Z-table, we got p_value = 1 - 0.7224 = 0.2776 # 0.56!!!

If you check the vignette of ttest_1samp, it writes:
So it's a two-sided p-value, meaning what the sum of probabilities of getting a absolute t-statistic more extreme than this.
The t distribution is symmetric, so we can take the -abs(t stat) and multiply by two for a 2 sided test, and the p-value will be:
from scipy.stats import t
2*t.cdf(-0.5973799001456603, 13)
0.5605155888171379
Your derived value will be correct for a one-sided t-test :)

Related

Inverse problem with scipy.optimize.differential_evolution

I've been doing a code trying to calculate y=ax+b with values for x and y. So I want to determine a and b.
For example, I have y=2x+1 so
xexp =[ 0., 0.1111111, 0.2222222, 0.3333333, 0.4444444, 0.5555556, 0.6666667, 0.7777778, 0.8888889, 1.] and
yexp=[1., 1.2222222, 1.4444444, 1.6666667, 1.8888889, 2.1111111, 2.3333333, 2.5555556, 2.7777778, 3.]
I need to create a loop in my code to calculate every ycalc[i] and return as my objective function the mean error of each iteration using all the given data.
from scipy.optimize import differential_evolution
import numpy as np
from matplotlib import pyplot
from scipy.optimize import LinearConstraint
from scipy.optimize import NonlinearConstraint
#2*x+1 // a*x+b --> have xexp and yexp need to calc a and b when xcalc = xexp
errototal=[]
ycalc=[]
erro=[]
#experimental values
xexp =[ 0., 0.1111111, 0.2222222, 0.3333333, 0.4444444, 0.5555556, 0.6666667, 0.7777778, 0.8888889, 1.]
yexp=[1., 1.2222222, 1.4444444, 1.6666667, 1.8888889, 2.1111111, 2.3333333, 2.5555556, 2.7777778, 3.]
#obj function
def func(xcalc):
return np.mean(sum(errototal))
bounds = [(-5, 5), (-5, 5)]
plot1=[]
def condition(xk, convergence):
for i in range(0,9):
ycalc.append(xk[0]*xexp[i]+xk[1])
# ycalc.append(xk[0]*xk[2]+xk[1])
erro.append(abs(yexp[i]-ycalc[i]))
errototal.append(erro[i])
plot1.append(func(sum(errototal)))
print(xk,errototal,ycalc, yexp)
print(sum(errototal))
result = differential_evolution(func, bounds, disp=True, callback=condition)
# line plot of best objective function values
pyplot.plot(plot1, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()
My code stops in the first step. Any tips?

Different values weibull pdf

I was wondering why the values of weibull pdf with the prebuilt function dweibull.pdf are more or less the half they should be
I did a test. For the same x I created the weibull pdf for A=10 and K=2 twice, one by writing myself the formula and the other one with the prebuilt function of dweibull.
import numpy as np
from scipy.stats import exponweib,dweibull
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
K=2.0
A=10.0
x=np.arange(0.,20.,1)
#own function
def weib(data,a,k):
return (k / a) * (data / a)**(k - 1) * np.exp(-(data / a)**k)
pdf1=weib(x,A,K)
print sum(pdf1)
#prebuilt function
dist=dweibull(K,1,A)
pdf2=dist.pdf(x)
print sum(pdf2)
f=plt.figure()
suba=f.add_subplot(121)
suba.plot(x,pdf1)
suba.set_title('pdf dweibull')
subb=f.add_subplot(122)
subb.plot(x,pdf2)
subb.set_title('pdf own function')
f.show()
It seems with dweibull the pdf values are the half but that this is wrong as the summation should be in total 1 and not aroung 0.5 as it is with dweibull. By writing myself the formula the summation is around 1[
scipy.stats.dweibull implements the double Weibull distribution. Its support is the real line. Your function weib corresponds to the PDF of scipy's weibull_min distribution.
Compare your function weib to weibull_min.pdf:
In [128]: from scipy.stats import weibull_min
In [129]: x = np.arange(0, 20, 1.0)
In [130]: K = 2.0
In [131]: A = 10.0
Your implementation:
In [132]: weib(x, A, K)
Out[132]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
scipy.stats.weibull_min.pdf:
In [133]: weibull_min.pdf(x, K, scale=A)
Out[133]:
array([ 0. , 0.019801 , 0.03843158, 0.05483587, 0.0681715 ,
0.07788008, 0.08372116, 0.0857677 , 0.08436679, 0.08007445,
0.07357589, 0.0656034 , 0.05686266, 0.04797508, 0.03944036,
0.03161977, 0.02473752, 0.01889591, 0.014099 , 0.0102797 ])
By the way, there is a mistake in this line of your code:
dist=dweibull(K,1,A)
The order of the parameters is shape, location, scale, so you are setting the location parameter to 1. That's why the values in your second plot are shifted by one. That line should have been
dist = dweibull(K, 0, A)

Fast b-spline algorithm with numpy/scipy

I need to compute bspline curves in python. I looked into scipy.interpolate.splprep and a few other scipy modules but couldn't find anything that readily gave me what I needed. So i wrote my own module below. The code works fine, but it is slow (test function runs in 0.03s, which seems like a lot considering i'm only asking for 100 samples with 6 control vertices).
Is there a way to simplify the code below with a few scipy module calls, which presumably would speed it up? And if not, what could i do to my code to improve its performance?
import numpy as np
# cv = np.array of 3d control vertices
# n = number of samples (default: 100)
# d = curve degree (default: cubic)
# closed = is the curve closed (periodic) or open? (default: open)
def bspline(cv, n=100, d=3, closed=False):
# Create a range of u values
count = len(cv)
knots = None
u = None
if not closed:
u = np.arange(0,n,dtype='float')/(n-1) * (count-d)
knots = np.array([0]*d + range(count-d+1) + [count-d]*d,dtype='int')
else:
u = ((np.arange(0,n,dtype='float')/(n-1) * count) - (0.5 * (d-1))) % count # keep u=0 relative to 1st cv
knots = np.arange(0-d,count+d+d-1,dtype='int')
# Simple Cox - DeBoor recursion
def coxDeBoor(u, k, d):
# Test for end conditions
if (d == 0):
if (knots[k] <= u and u < knots[k+1]):
return 1
return 0
Den1 = knots[k+d] - knots[k]
Den2 = knots[k+d+1] - knots[k+1]
Eq1 = 0;
Eq2 = 0;
if Den1 > 0:
Eq1 = ((u-knots[k]) / Den1) * coxDeBoor(u,k,(d-1))
if Den2 > 0:
Eq2 = ((knots[k+d+1]-u) / Den2) * coxDeBoor(u,(k+1),(d-1))
return Eq1 + Eq2
# Sample the curve at each u value
samples = np.zeros((n,3))
for i in xrange(n):
if not closed:
if u[i] == count-d:
samples[i] = np.array(cv[-1])
else:
for k in xrange(count):
samples[i] += coxDeBoor(u[i],k,d) * cv[k]
else:
for k in xrange(count+d):
samples[i] += coxDeBoor(u[i],k,d) * cv[k%count]
return samples
if __name__ == "__main__":
import matplotlib.pyplot as plt
def test(closed):
cv = np.array([[ 50., 25., -0.],
[ 59., 12., -0.],
[ 50., 10., 0.],
[ 57., 2., 0.],
[ 40., 4., 0.],
[ 40., 14., -0.]])
p = bspline(cv,closed=closed)
x,y,z = p.T
cv = cv.T
plt.plot(cv[0],cv[1], 'o-', label='Control Points')
plt.plot(x,y,'k-',label='Curve')
plt.minorticks_on()
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(35, 70)
plt.ylim(0, 30)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
test(False)
The two images below shows what my code returns with both closed conditions:
So after obsessing a lot about my question, and much research, i finally have my answer. Everything is available in scipy , and i'm putting my code here so hopefully someone else can find this useful.
The function takes in an array of N-d points, a curve degree, a periodic state (opened or closed) and will return n samples along that curve. There are ways to make sure the curve samples are equidistant but for the time being i'll focus on this question, as it is all about speed.
Worthy of note: I can't seem to be able to go beyond a curve of 20th degree. Granted, that's overkill already, but i figured it's worth mentioning.
Also worthy of note: on my machine the code below can calculate 100,000 samples in 0.017s
import numpy as np
import scipy.interpolate as si
def bspline(cv, n=100, degree=3, periodic=False):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
periodic: True - Curve is closed
False - Curve is open
"""
# If periodic, extend the point array by count+degree+1
cv = np.asarray(cv)
count = len(cv)
if periodic:
factor, fraction = divmod(count+degree+1, count)
cv = np.concatenate((cv,) * factor + (cv[:fraction],))
count = len(cv)
degree = np.clip(degree,1,degree)
# If opened, prevent degree from exceeding count-1
else:
degree = np.clip(degree,1,count-1)
# Calculate knot vector
kv = None
if periodic:
kv = np.arange(0-degree,count+degree+degree-1)
else:
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Calculate query range
u = np.linspace(periodic,(count-degree),n)
# Calculate result
return np.array(si.splev(u, (kv,cv.T,degree))).T
To test it:
import matplotlib.pyplot as plt
colors = ('b', 'g', 'r', 'c', 'm', 'y', 'k')
cv = np.array([[ 50., 25.],
[ 59., 12.],
[ 50., 10.],
[ 57., 2.],
[ 40., 4.],
[ 40., 14.]])
plt.plot(cv[:,0],cv[:,1], 'o-', label='Control Points')
for d in range(1,21):
p = bspline(cv,n=100,degree=d,periodic=True)
x,y = p.T
plt.plot(x,y,'k-',label='Degree %s'%d,color=colors[d%len(colors)])
plt.minorticks_on()
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(35, 70)
plt.ylim(0, 30)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Results for both opened or periodic curves:
ADDENDUM
As of scipy-0.19.0 there is a new scipy.interpolate.BSpline function that can be used.
import numpy as np
import scipy.interpolate as si
def scipy_bspline(cv, n=100, degree=3, periodic=False):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
periodic: True - Curve is closed
"""
cv = np.asarray(cv)
count = cv.shape[0]
# Closed curve
if periodic:
kv = np.arange(-degree,count+degree+1)
factor, fraction = divmod(count+degree+1, count)
cv = np.roll(np.concatenate((cv,) * factor + (cv[:fraction],)),-1,axis=0)
degree = np.clip(degree,1,degree)
# Opened curve
else:
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = si.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
Testing for equivalency:
p1 = bspline(cv,n=10**6,degree=3,periodic=True) # 1 million samples: 0.0882 sec
p2 = scipy_bspline(cv,n=10**6,degree=3,periodic=True) # 1 million samples: 0.0789 sec
print np.allclose(p1,p2) # returns True
Giving optimization tips without profiling data is a bit like shooting in the dark. However, the function coxDeBoor seems to be called very often. This is where I would start optimizing.
Function calls in Python are expensive. You should try to replace the coxDeBoor recursion with iteration to avoid excessive function calls. Some general information how to do this can be found in answers to this question. As stack/queue you can use collections.deque.

Fitting negative binomial in python

In scipy there is no support for fitting a negative binomial distribution using data
(maybe due to the fact that the negative binomial in scipy is only discrete).
For a normal distribution I would just do:
from scipy.stats import norm
param = norm.fit(samp)
Is there something similar 'ready to use' function in any other library?
Statsmodels has discrete.discrete_model.NegativeBinomial.fit(), see here:
https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit.html#statsmodels.discrete.discrete_model.NegativeBinomial.fit
Not only because it is discrete, also because maximum likelihood fit to negative binomial can be quite involving, especially with an additional location parameter. That would be the reason why .fit() method is not provided for it (and other discrete distributions in Scipy), here is an example:
In [163]:
import scipy.stats as ss
import scipy.optimize as so
In [164]:
#define a likelihood function
def likelihood_f(P, x, neg=1):
n=np.round(P[0]) #by definition, it should be an integer
p=P[1]
loc=np.round(P[2])
return neg*(np.log(ss.nbinom.pmf(x, n, p, loc))).sum()
In [165]:
#generate a random variable
X=ss.nbinom.rvs(n=100, p=0.4, loc=0, size=1000)
In [166]:
#The likelihood
likelihood_f([100,0.4,0], X)
Out[166]:
-4400.3696690513316
In [167]:
#A simple fit, the fit is not good and the parameter estimate is way off
result=so.fmin(likelihood_f, [50, 1, 1], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[167]:
(4418.599495886474, array([ 59.61196161, 0.28650831, 1.15141838]))
In [168]:
#Try a different set of start paramters, the fit is still not good and the parameter estimate is still way off
result=so.fmin(likelihood_f, [50, 0.5, 0], args=(X,-1), full_output=True, disp=False)
P1=result[0]
(result[1], result[0])
Out[168]:
(4417.1495981801972,
array([ 6.24809397e+01, 2.91877405e-01, 6.63343536e-04]))
In [169]:
#In this case we need a loop to get it right
result=[]
for i in range(40, 120): #in fact (80, 120) should probably be enough
_=so.fmin(likelihood_f, [i, 0.5, 0], args=(X,-1), full_output=True, disp=False)
result.append((_[1], _[0]))
In [170]:
#get the MLE
P2=sorted(result, key=lambda x: x[0])[0][1]
sorted(result, key=lambda x: x[0])[0]
Out[170]:
(4399.780263084549,
array([ 9.37289361e+01, 3.84587087e-01, 3.36856705e-04]))
In [171]:
#Which one is visually better?
plt.hist(X, bins=20, normed=True)
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P1[0]), P1[1], np.round(P1[2])), 'g-')
plt.plot(range(260), ss.nbinom.pmf(range(260), np.round(P2[0]), P2[1], np.round(P2[2])), 'r-')
Out[171]:
[<matplotlib.lines.Line2D at 0x109776c10>]
I know this thread is quite old, but current readers may want to look at this repo which is made for this purpose: https://github.com/gokceneraslan/fit_nbinom
There's also an implementation here, though part of a larger package: https://github.com/ernstlab/ChromTime/blob/master/optimize.py
I stumbled across this thread, and found an answer for anyone else wondering.
If you simply need the n, p parameterisation used by scipy.stats.nbinom you can convert the mean and variance estimates:
mu = np.mean(sample)
sigma_sqr = np.var(sample)
n = mu**2 / (sigma_sqr - mu)
p = mu / sigma_sqr
If you the dispersionparameter you can use a negative binomial regression model from statsmodels with just an interaction term. This will find the dispersionparameter alpha using MLE.
# Data processing
import pandas as pd
import numpy as np
# Analysis models
import statsmodels.formula.api as smf
from scipy.stats import nbinom
def convert_params(mu, alpha):
"""
Convert mean/dispersion parameterization of a negative binomial to the ones scipy supports
Parameters
----------
mu : float
Mean of NB distribution.
alpha : float
Overdispersion parameter used for variance calculation.
See https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_formulations
"""
var = mu + alpha * mu ** 2
p = mu / var
r = mu ** 2 / (var - mu)
return r, p
# Generate sample data
n = 2
p = 0.9
sample = nbinom.rvs(n=n, p=p, size=10000)
# Estimate parameters
## Mean estimates expectation parameter for negative binomial distribution
mu = np.mean(sample)
## Dispersion parameter from nb model with only interaction term
nbfit = smf.negativebinomial("nbdata ~ 1", data=pd.DataFrame({"nbdata": sample})).fit()
alpha = nbfit.params[1] # Dispersion parameter
# Convert parameters to n, p parameterization
n_est, p_est = convert_params(mu, alpha)
# Check that estimates are close to the true values:
print("""
{:<3} {:<3}
True parameters: {:<3} {:<3}
Estimates : {:<3} {:<3}""".format('n', 'p', n, p,
np.round(n_est, 2), np.round(p_est, 2)))

Scipy: difference between optimize.fmin and optimize.leastsq

What's the difference between scipy's optimize.fmin and optimize.leastsq? They seem to be used in pretty much the same way in this example page. The only difference I can see is that leastsq actually calculates the sum of squares on its own (as its name would suggest) while when using fmin one has to do this manually. Other than that, are the two functions equivalent?
Different algorithms underneath.
fmin is using the simplex method; leastsq is using least squares fitting.
Just to add some information, I am developing a module to fit a biexponential function and the time difference between leastsq and minimize seems to be almost 100 times. Have a look at the code below for more details.
I used a biexponential curve which is a sum of two exponents and the model function has 4 parameters to fit. S, f, D_star and D.
All default parameters for fitting were used
S [f e^(-x * D_star) + (1 - f) e^(-x * D)]
('Time taken for minimize:', 0.011617898941040039)
('Time taken for leastsq :', 0.0003180503845214844)
The code used :
import numpy as np
from scipy.optimize import minimize, leastsq
from time import time
def ivim_function(params, bvals):
"""The Intravoxel incoherent motion (IVIM) model function.
S(b) = S_0[f*e^{(-b*D\*)} + (1-f)e^{(-b*D)}]
S_0, f, D\* and D are the IVIM parameters.
Parameters
----------
params : array
parameters S0, f, D_star and D of the model
bvals : array
bvalues
References
----------
.. [1] Le Bihan, Denis, et al. "Separation of diffusion
and perfusion in intravoxel incoherent motion MR
imaging." Radiology 168.2 (1988): 497-505.
.. [2] Federau, Christian, et al. "Quantitative measurement
of brain perfusion with intravoxel incoherent motion
MR imaging." Radiology 265.3 (2012): 874-881.
"""
S0, f, D_star, D = params
S = S0 * (f * np.exp(-bvals * D_star) + (1 - f) * np.exp(-bvals * D))
return S
def _ivim_error(params, bvals, signal):
"""Error function to be used in fitting the IVIM model
"""
return (signal - ivim_function(params, bvals))
def sum_sq(params, bvals, signal):
"""Sum of squares of the errors. This function is minimized"""
return np.sum(_ivim_error(params, bvals, signal)**2)
x0 = np.array([100., 0.20, 0.008, 0.0009])
bvals = np.array([0., 10., 20., 30., 40., 60., 80., 100.,
120., 140., 160., 180., 200., 220., 240.,
260., 280., 300., 350., 400., 500., 600.,
700., 800., 900., 1000.])
data = ivim_function(x0, bvals)
optstart = time()
opt = minimize(sum_sq, x0, args=(bvals, data))
optend = time()
time_taken = optend - optstart
print("Time taken for opt:", time_taken)
lstart = time()
lst = leastsq(_ivim_error,
x0,
args=(bvals, data),)
lend = time()
time_taken = lend - lstart
print("Time taken for leastsq :", time_taken)
print('Parameters estimated using minimize :', opt.x)
print('Parameters estimated using leastsq :', lst[0])

Categories

Resources