Tracking down the assumptions made by SciPy's `ttest_ind()` function - python

I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.
Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.
My code:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]; num2 = pop2.shape[0];
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.
With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).
Second update:
While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.
Bonus question :)
(More statistical theory related; feel free to ignore)
Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.
So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?

By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.
I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.
As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).

You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.
This should give you your expected result:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]
num2 = pop2.shape[0];
var1 = np.var(pop1, ddof=1)
var2 = np.var(pop2, ddof=1)
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1))
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.
For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.

Looks like you forgot **2 to the numerator of your df. The Welch-Satterthwaite degrees of freedom.
df = (np.var(pop1)/num1 + np.var(pop2)/num2)/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
should be:
df = (np.var(pop1)/num1 + np.var(pop2)/num2)**2/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )


How can I generate numbers in a set range but skewed towards a specific point? [duplicate]

I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.

Question regarding differences in calculating T statistics in Python for the difference in means

I am re-learning introductory statistics and wanted to try implementing my own versions of the general and unpooled formulas that find the T Value. I implemented it in 2 ways, one by just replicating the formulas as is as Python Functions. The other was to use Python's ability to generate a normal distribution and use that to find the difference in means. But I noticed my values were pretty different in both versions. So my question is why is there a difference? Is it with how the function works itself?
Here's the "generate a distribution itself" method:
from numpy.random import seed
from numpy.random import normal
from scipy import stats
from datetime import datetime
import math
#Plan: Generate 2 random normal distributions of the desired critiera. And T Test them
data1 = normal(loc=65.2, scale=7.8, size=30)
data2 = normal(loc=70.3, scale=8.4, size=30)
stats.ttest_ind(a=data1, b=data2)
Ttest_indResult(statistic=-2.029830829733737, pvalue=0.04696953433513939)
As you can see, it gives a T statistic of ~-2.0298 and a p value of ~ 0.0470.
Here's my "manual version":
def pop_2_mean_pooled_t(mean1, mean2, s1, s2, n1, n2):
dof = (n1+n2)-2
mean_diff = mean1 - mean2
#The N part on the right
right_n = math.sqrt((1/n1) + (1/n2))
#The Sp part
sp_numereator_left = ((n1-1)*(s1**2))
sp_numberator_right = ((n2-1)*(s2**2))
sp = math.sqrt((sp_numereator_left + sp_numberator_right)/(dof))
pooled_sp = sp*right_n
t = mean_diff/pooled_sp
p = stats.t.cdf(t, dof)
print("T is " +str(t))
print("p is " +str(p))
return t, p
pop_2_mean_pooled_t(65.2, 70.3, 7.8, 8.4, 30, 30)
T is -2.4368742610942298
p is 0.00895208222413155
(-2.4368742610942298, 0.00895208222413155)
As you can see, it gives a T statistic of ~-2.439 and a p value of ~ 0.009.
My question is why is there a discrepancy here? My "manual version" is closer to the example I was referencing. But surely the generator one should also be?
My understanding is that if a sample is significantly large enough, it would resemble a normal distribution. Therefore, one could generate a normal distribution using code and use that to approximate the corresponding T Values. For some reason, that differed quite a bit from my "manual" version
Your thinking is basically correct (I did not check your formulae though). What your encountering is in the nature of the problem: the two random samples you're drawing are, well, random and they differ in subsequent runs, so you will always get a different p-value ant the t-statistics.
Two suggestions from me:
increase the sample size in the first snippet to hundreds (not 30): you should get much closer to the stats from the second snippet.
keep 30 samples in the first snippet but run the simulation several times; you will learn the distributions of p-values and t-statistics and, again, you can check the values from your second snippet against the simulated distributions.
(Some conceptual flaws occur in this approach, e.g. repeated testing affects the p-value, but let us put them aside for now; the goal is to see your two sets of values converge.)

is seaborn confidence interval computed correctly?

First, I must admit that my statistics knowledge is rusty at best: even when it was shining new, it's not a discipline I particularly liked, which means I had a hard time making sense of it.
Nevertheless, I took a look at how the barplot graphs were calculating error bars, and was surprised to find a "confidence interval" (CI) used instead of (the more common) standard deviation. Researching more CI led me to this wikipedia article which seems to say that, basically, a CI is computed as:
Or, in pseudocode:
def ci_wp(a):
"""calculate confidence interval using Wikipedia's formula"""
m = np.mean(a)
s = 1.96*np.std(a)/np.sqrt(len(a))
return m - s, m + s
But what we find in seaborn/ is:
def ci(a, which=95, axis=None):
"""Return a percentile range from an array of values."""
p = 50 - which / 2, 50 + which / 2
return percentiles(a, p, axis)
Now maybe I'm missing this completely, but this seems just like a completely different calculation than the one proposed by Wikipedia. Can anyone explain this discrepancy?
To give another example, from comments, why do we get so different results between:
array([ 2.475, 96.525])
>>> ci_wp(np.arange(100))
And to compare with other statistical tools:
def ci_std(a):
"""calculate margin of error using standard deviation"""
m = np.mean(a)
s = np.std(a)
return m-s, m+s
def ci_sem(a):
"""calculate margin of error using standard error of the mean"""
m = np.mean(a)
s = sp.stats.sem(a)
return m-s, m+s
Which gives us:
>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)
>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)
Or with a random sample:
rng = np.random.RandomState(10)
a = rng.normal(size=100)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)
... which yields:
[-1.9667006 2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)
Why are Seaborn's numbers so radically different from the other results?
Your calculation using this Wikipedia formula is completely right. Seaborn just uses another method: It's well described by Dragicevic [1]:
[It] consists of generating many alternative datasets from the experimental data by randomly drawing observations with replacement. The variability across these datasets is assumed to approximate sampling error and is used to compute so-called bootstrap confidence intervals. [...] It is very versatile and works for many kinds of distributions.
In the Seaborn's source code, a barplot uses estimate_statistic which bootstraps the data then computes the confidence interval on it:
array([43.91, 55.21025])
The result is consistent with your calculation.
[1] Dragicevic, P. (2016). Fair statistical communication in HCI. In Modern Statistical Methods for HCI (pp. 291-330). Springer, Cham.
You need to check the code of percentiles. The seaborn ci code you posted simply computes the percentile limits. This interval has a defined mean of 50 (median) and a default range of 95% confidence interval. The actual mean, the standard deviation, etc. will appear in the percentiles routine.

equation system with fsolve

I try to find a solution for a system of equations by using scipy.optimize.fsolve in python 2.7. The goal is to calculate equilibrium concentrations for a chemical system. Due to the nature of the problem, some of the constants are very small. Now for some combinations i do get a proper solution. For some parameters i don't find a solution. Either the solutions are negative, which is not reasonable from a physical point of view or fsolve produces:
ier = 3, 'xtol=0.000000 is too small, no further improvement in the approximate\n solution is possible.')
ier = 4, 'The iteration is not making good progress, as measured by the \n improvement from the last five Jacobian evaluations.')
ier = 5, 'The iteration is not making good progress, as measured by the \n improvement from the last ten iterations.')
It seems to me, based on my research, that the failure to find proper solutions of the equation system is connected to the datatype float.64 not being precise enough. As a friend pointed out, the system is not well conditioned with parameters differing in several magnitudes.
So i tried to use fsolve with the mpfr type provided by the gmpy2 module but that resulted in the following error:
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
Now here is a small example with parameter which lead to a solution if the randomized starting parameters fit happen to be good. However if the constant C_HCL is chosen to be something like 1e-4 or bigger then i never find a proper solution.
from numpy import *
from scipy.optimize import *
K_1 = 1e-8
K_2 = 1e-8
K_W = 1e-30
C_HCL = 1e-11
C_HL = 1e-6
if C_HCL-C_NAOH > 0:
Saeure_Base = C_HCL-C_NAOH+sqrt(K_W)
OH_init = K_W/(Saeure_Base)
elif C_HCL-C_NAOH < 0:
OH_init = C_NAOH-C_HCL+sqrt(K_W)
Saeure_Base = K_W/OH_init
# some randomized start parameters
G1 = random.uniform(0, 2)*Saeure_Base
G2 = random.uniform(0, 2)*OH_init
G3 = random.uniform(1, 2)*C_HL*(sqrt(K_W))/(Saeure_Base+OH_init)
G4 = random.uniform(0.1, 1)*(C_HL - G3)/2
G5 = C_HL - G3 - G4
zGuess = array([G1,G2,G3,G4,G5])
#equation system / 5 variables --> H3O, OH, HL, H2L, L
def myFunction(z):
H3O = z[0]
OH = z[1]
HL = z[2]
H2L = z[3]
L = z[4]
F = empty((5))
F[0] = H3O*L/HL - K_1
F[1] = OH*H2L/HL - K_2
F[2] = K_W - OH*H3O
F[3] = C_HL - HL - H2L - L
return F
z = fsolve(myFunction,zGuess, maxfev=10000, xtol=1e-15, full_output=1,factor=0.1)
print z
So the questions are. Is this problem based on the precision of float.64 and
if yes , (how) can it be solved with python? Is fsolve the way to go? Would i need to change the fsolve function so it accepts a different data type?
The root of your problem is either theoretical or numerical.
The scipy.optimize.fsolvefunction is based on the MINPACK Fortran solver ( This solver use a Newton-Raphson optimisation algorithm to provide the solution.
There are underlying assumptions about the smoothness of the function when you use this algorithm. For example, the jacobian matrix at the solution point x is supposed to be invertible. The one you are more concerned about is the basins of attraction.
In order to converge, the starting point of the algorithm needs to be near the actual solution, i.e. in the basins of attraction. This condition is always met for convex functions, however it is easy to find some functions for which this algorithm behaves badly. Your function is one of this as you have a fraction of your inputs parameters.
To address this issue you should just change the starting point. This starting point becomes also very important for functions with multiple solutions: this picture from the wikipedia article shows you the solution found depending of the starting point (five colours for five solutions); so you should be careful with your solution and actually check the "physical" aspects of your solution.
For the numerical aspects, the Newton-Raphson algorithm needs to have the value of the jacobian matrix (the derivatives matrix). If it is not provided to the MINPACK solver, the jacobian is estimated with a finite-difference formula. The perturbation step for the finite difference formula need to be provided epsfcn=None, the None being here as default value only in the case where fprimeis provided (there is no need for the jacobian estimation in this case). So first you should incorporate that. You could also specify directly the jacobian by derivating your function by hand.
However, the minimum value for the step size will be the machine precision, also called machine epsilon. For your problem, you have very small inputs values which can be a problem. I would suggest multiply everyone of them by the same value (like 10^6), it is equivalent to a change of the units but will avoid rounding up errors and problems with machine precision.
This problem is also important when you look at the parameter xtol=1e-15 you provided. In your error message, it gives xtol=0.000000, as it is below machine precision and cannot be taken into account. Also, if you look at your line F[2] = K_W - OH*H3O, given the machine precision, it does not matter if K_W is 1e-15or 1e-30. 0 is a solution for both of this case compare to the machine precision. To avoid this problem, just multiply everything by a bigger value.
So to sum up:
For the Newton-Raphson algorithm, the initialisation point matters !
For this algorithm, you should specify how you compute the jacobian !
In numerical computation, never work with small values. You can easily change the dimension to something different: it is basic units conversion, like working in gram instead of kilogram.

How to do calibration accounting for resolution of the instrument

I have to calibrate a distance measuring instrument which gives capacitance as output, I am able to use numpy polyfit to find a relation and apply it get distance. But I need to include limits of detection 0.0008 m as it is the resolution of the instrument.
My data is:
cal_distance = [.1 , .4 , 1, 1.5, 2, 3]
cal_capacitance = [1971, 2336, 3083, 3720, 4335, 5604]
raw_data = [3044,3040,3039,3036,3033]
I need my distance values to be like .1008, .4008 that represents the limits of detection of the instrument.
I have used the following code:
coeffs = np.polyfit(cal_capacitance, cal_distance, 1)
new_distance = []
for i in raw_data:
d = i*coeffs[0] + coeffs[1]
I have a csv file and actually used a pandas dataframe with date time index to store the raw data, but for simplicity I have given a list here.
I need to include the limits of detection in the calibration process to get it right.
Limit of detection is the accuracy of your measurement (the smallest 'step' you can resolve)
polyfit gives you a 'model' of the best fit function f of the relation
distance = f(capacitance)
You use 1 as the degree of the polynomial so you're basically fitting a line.
So, first off you need to look into the accuracy of the fit: this is returned by using the 3rd parameter full=True.
(see the docs: for more details)
You will get the residual of the fit.
Is it actually smaller than the LOD? Otherwise your limiting factor is the fitting
accuracy. In your particular case it looks like it is 0.00017021, so indeed below the 0.0008 LOD.
Second, why 'add' LOD to the reading? Your reading is the reading. then LOD is the +/- range the distance could really be within. Adding it to the end result does not seem to make sense here.
You should instead report the final value as 'new distance' +/- LOD.
Is your raw data all measurements of the same distance? If so, you can see that the standard deviation of this measurement using the fit is 0.0029680362423331122, ( numpy.std(new_distance) ) and range is 0.0087759439302268483, which is 10x over the LOD, so here your limiting factor really seems to be the measuring conditions.
Not to beat a dead horse, but LOD and precision are two completely different things. LOD is typically defined as three-times the standard deviation of the noise of your instrument, which would be equivalent to the minimum capacitance (or distance , which is related to capacitance here) your instrument can detect. i.e. anything less than that is equivalent to zero (more or less). But your precision is the minimum change in capacitance that can be detected by your instrument, which may or may not be less than the LOD. Such terms (in addition to accuracy) are common sources of confusion. While you may know what you are talking about when you say LOD (and everyone else may be able to understand that you really mean precision) it would be beneficial to use the proper notation. Just a thought...

