Log Normal Random Variables with Scipy - python

I fail to understand the very basics of creating lognormal variables as documented here.
The log normal distribution takes on mean and variance as parameters. I would like to create a frozen distribution using these parameters and then get cdf, pdf etc.
However, in the documentation, they get the frozen distribution using
from scipy.stats import lognorm
s = 0.953682269606
rv = lognorm(s)
's' seems to be the standard deviation. I tried to use the 'loc' and 'scale' parameters instead of 's', but that generated an error (s is a required parameter). How can I generate a frozen distribution with parameter values 'm', 's' for location and scale?

The mystery is solved (edit 3)
μ corresponds to ln(scale) (!)
σ corresponds to shape (s)
loc is not needed for setting any of σ and μ
I think it is a severe problem that this is not clearly documented. I guess many have fallen for this when doing simple tests with the lognormal distribution in SciPy.
Why is that?
The stats module treats loc and scale the same for all distributions (this is not explicitly written down, but can be inferred when reading between the lines). My suspicion was that loc is substracted from x, and the result is divided by scale (and the result is treated as the new x). I tested for that, and this turned out to be the case.
What does it mean for the lognormal distribution? In the canonical definition of the lognormal distribution the term ln(x) appears. Obviously, the same term appears in SciPy's implementation. With above's considerations, this is how loc and scale end up in the logarithm:
ln((x-loc)/scale)
By common logarithm calculus, this is the same as
ln(x-loc) - ln(scale)
In the canonical definition of the lognormal distribution the term simply is ln(x) - μ. Comparing SciPy's approach and the canonical approach then provides the crucial insight: ln(scale) represents μ. loc, however, has no correspondence in the canonical definition and is better left at 0. Further below, I have argued for the fact that shape (s) is σ.
Proof
>>> import math
>>> from scipy.stats import lognorm
>>> mu = 2
>>> sigma = 2
>>> l = lognorm(s=sigma, loc=0, scale=math.exp(mu))
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 54.59815 stddev: 399.71719
I use WolframAlpha as a reference. It provides analytically determined values for the mean and standard deviation of the lognormal distribution.
http://www.wolframalpha.com/input/?i=log-normal+distribution%2C+mean%3D2%2C+sd%3D2
The values match.
WolframAlpha as well as SciPy come up with the mean and standard deviation by evaluating analytical terms. Let's perform an empirical test, by taking many samples from the SciPy distribution, and calculate their mean and standard deviation "manually" (from the whole set of samples):
>>> import numpy as np
>>> samples = l.rvs(size=2*10**7)
>>> print("mean: %.5f stddev: %.5f" % (np.mean(samples), np.std(samples)))
mean: 54.52148 stddev: 380.14457
This is still not perfectly converged, but I think it is proof enough that the samples correspond to the same distribution that WolframAlpha assumed, given μ=2 and σ=2.
And another small edit: it looks like proper usage of a search engine would have helped, we were not the first to be trapped by this:
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
http://nbviewer.ipython.org/url/xweb.geos.ed.ac.uk/~jsteven5/blog/lognormal_distributions.ipynb
scipy, lognormal distribution - parameters
Another edit: now that I know how it behaves, I realize that be behavior in principle is documented. In the "notes" section we can read:
with shape parameter sigma and scale parameter exp(mu)
It is just really not obvious (we both were not able to appreciate the importance of this small sentence). I guess the reason that we could not understand what this sentence means is that the analytical expression shown in the notes section does not include loc and scale. I guess this is worth a bug report / documentation improvement.
Original answer:
Indeed, the shape parameter topic is not well-documented when looking into the docs page for a particular distribution. I recommend having a look at the main stats documentation -- there is a section on shape parameters:
http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#shape-parameters
It looks like there should be a lognorm.shapes property, telling you about what the s parameter means, specifically.
Edit:
There is only one parameter, indeed:
>>> lognorm.shapes
's'
When comparing the general definition of the lognormal distribution (from Wikipedia):
and the formula given by the scipy docs:
lognorm.pdf(x, s) = 1 / (s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2)
it becomes obvious that s is the true σ (sigma).
However, from the docs it is not obvious how the loc parameter is related to μ (mu).
It could be as in ln(x-loc), which would not correspond to μ in the general formula, or it could be ln(x)-loc, which would ensure correspondence between loc and μ. Try it out! :)
Edit 2
I have made comparisons between what WolframAlpha (WA) and SciPy say. WA is pretty clear about that it uses μ and σ as generally understood (as defined in linked Wikipedia article).
>>> l = lognorm(s=2, loc=0)
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 7.38906 stddev: 54.09584
This matches WA's output.
Now, for loc not being zero, there is a mismatch. Example:
>>> l = lognorm(s=2, loc=1)
>>> print("mean: %.5f stddev: %.5f" % (l.mean(), l.std()))
mean: 8.38906 stddev: 54.09584
WA gives a mean of 20.08 and a standard deviation of 147. There you have it, loc does not correspond to μ in the classical definition of the lognormal distribution.

Related

How can I generate numbers in a set range but skewed towards a specific point? [duplicate]

I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.

How to calculate one-sided tolerance interval with scipy

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

How to obtain a python scipy-type continuous rv distribution object that is bounded?

I would like to define a bounded version of a continuous random variable distribution (say, an exponential, but I might want to use others as well). The bounds are 0 and 1. I would like to
draw random variates (as done by scipy.stats.rv_continuous.rvs),
use the ppf (percentage point function) (as done by scipy.stats.rv_continuous.ppf), and possibly
use the cdf (cumulative density function) (as done by scipy.stats.rv_continuous.cdf)
Possible approaches I can think of:
Getting random variates in an ad hoc way is not difficult
import scipy.stats
d = scipy.stats.expon(0, 3/10.) # an exponential distribution as an example
rv = d.rvs(size=target_number_of_rv)
rv = rv[0=<rv]
rv = rv[rv<=1]
while len(rv) < target_number_of_rv:
rv += d.rvs(1)
rv = rv[0=<rv]
rv = rv[rv<=1]
but 1) this is non-generic and potentially error-prone and 2) it does not help with the ppf or cdf.
Subclassing scipy.stats.rv_continuous, as is done here and here. Thereby, the ppf of scipy.stats.rv_continuous can be used. The drawback is that it requires the pdf (not just a pre-defined rv_continuous object or the pdf of the unbounded distribution and the bounds), and if this is wrong, cdf and ppf and everything else will be wrong as well.
Designing a class that cares for applying the bounds to the rv generation and for correcting the value of the ppf obtained from the unbounded object in scipy.stats. A drawback is that this is non-generic and error-prone as well and that it may be difficult to correct the ppf. My feeling is that the value of the cdf of the unbounded distribution could be scaled by what share of probability mass is out of the bounds (in total, lower and upper), but I may be wrong. That would be for lower and upper bounds l and u and any valid quantile x (with l<=x<=u): (cdf(x)-cdf(l))/(cdf(u)-cdf(l)). Obtaining the ppf would, however, require to invert the resulting function.
My feeling is that there might be a better and more generic way to do this. Is there? Maybe with sympy? Maybe by somehow obtaining the function object of the unbounded cdf and modifying it directly?
Python is version: 3.6.2, scipy is version 0.19.1.
If the distribution is one of those that is available in scipy.stats then you can evaluate its integral between the two bounds using the cdf for that distribution. Otherwise, you can define the pdf for rv_continuous and then use its cdf to get this integral.
Now, you have, in effect, the pdf for the bounded version of the pdf you want because you have calculated the normalising constant for it, in that integral. You can proceed to use rv_continuous with the form that you have for the pdf plus the normalising constant and with the bounds.
Here's what your code might be like. The variable scale is set according to the scipy documents. norm is the integral of the exponential pdf over [0,1]. Only about .49 of the probability mass is accounted for. Therefore, to make the exponential, when truncated to the [0,1] interval give a mass of one we must divide its pdf by this factor.
Truncated_expon is defined as a subclass of rv_continuous as in the documentation. By supplying its pdf we make it possible (at least for such a simple integral!) for scipy to calculate this distribution's cdf and thereby to calculate random samples.
I have calculated the cdf at one as a check.
>>> from scipy import stats
>>> lamda = 2/3
>>> scale = 1/lamda
>>> norm = stats.expon.cdf(1, scale=scale)
>>> norm
0.48658288096740798
>>> from math import exp
>>> class Truncated_expon(stats.rv_continuous):
... def _pdf(self, x, lamda):
... return lamda*exp(-lamda*x)/0.48658288096740798
...
>>> e = Truncated_expon(a=0, b=1, shapes='lamda')
>>> e.cdf(1, lamda=lamda)
1.0
>>> e.rvs(size=20, lamda=lamda)
array([ 0.20064067, 0.67646465, 0.89118679, 0.86093035, 0.14334989,
0.10505598, 0.53488779, 0.11606106, 0.41296616, 0.33650899,
0.95126415, 0.57481087, 0.04495104, 0.00308469, 0.23585195,
0.00653972, 0.59400395, 0.34919065, 0.91762547, 0.40098409])

Integrating a function using non-uniform measure (python/scipy)

I would like to integrate a function in python and provide the probability density (measure) used to sample values. If it's not obvious, integrating f(x)dx in [a,b] implicitly use the uniform probability density over [a,b], and I would like to use my own probability density (e.g. exponential).
I can do it myself, using np.random.* but then
I miss the optimizations available in scipy.integrate.quad. Or maybe all those optimizations assume the uniform density?
I need to do the error estimation myself, which is not trivial. Or maybe it is? Maybe the error is just the variance of sum(f(x))/n?
Any ideas?
As unutbu said, if you have the density function, the you can just integrate the product of your function with the pdf using scipy.integrate.quad.
For the distribution that are available in scipy.stats, we can also just use the expect function.
For example
>>> from scipy import stats
>>> f = lambda x: x**2
>>> stats.norm.expect(f, loc=0, scale=1)
1.0000000000000011
>>> stats.norm.expect(f, loc=0, scale=np.sqrt(2))
1.9999999999999996
scipy.integrate.quad also has some predefined weight functions, although they are not normalized to be probability density functions.
The approximation error depends on the settings for the call to integrate.quad.
Just for the sake of brevity, 3 ways were suggested for calculating the expected value of f(x) under the probability p(x):
Assuming p is given in closed-form, use scipy.integrate.quad to evaluate f(x)p(x)
Assuming p can be sampled from, sample N values x=P(N), then evaluate the expected value by np.mean(f(X)) and the error by np.std(f(X))/np.sqrt(N)
Assuming p is available at stats.norm, use stats.norm.expect(f)
Assuming we have the CDF(x) of the distribution rather than p(x), calculate H=Inverse[CDF] and then integrate f(H(x)) using scipy.integrate.quad
Another possibilty would be to integrate x -> f( H(x)) where H is the inverse of the cumulative distribution of your probability distribtion.
[This is because of change of variable: replacing y=CDF(x) and noting that p(x)=CDF'(x) yields the change dy=p(x)dx and thus int{f(x)p(x)dx}==int{f(x)dy}==int{f(H(y))dy with H the inverse of CDF.]

Tracking down the assumptions made by SciPy's `ttest_ind()` function

I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.
Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.
My code:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]; num2 = pop2.shape[0];
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
Update:
After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.
With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).
Second update:
While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.
Bonus question :)
(More statistical theory related; feel free to ignore)
Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.
So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?
By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.
I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.
As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).
You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.
This should give you your expected result:
import numpy as np
import scipy.stats as st
def compute_t_stat(pop1,pop2):
num1 = pop1.shape[0]
num2 = pop2.shape[0];
var1 = np.var(pop1, ddof=1)
var2 = np.var(pop2, ddof=1)
# The formula for t-stat when population variances differ.
t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)
# ADDED: The Welch-Satterthwaite degrees of freedom.
df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1))
# Am I computing this wrong?
# It should just come from the CDF like this, right?
# The extra parameter is the degrees of freedom.
one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )
# Computing with SciPy's built-ins
# My results don't match theirs.
t_ind, p_ind = st.ttest_ind(pop1, pop2)
return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind
PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.
For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.
Looks like you forgot **2 to the numerator of your df. The Welch-Satterthwaite degrees of freedom.
df = (np.var(pop1)/num1 + np.var(pop2)/num2)/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )
should be:
df = (np.var(pop1)/num1 + np.var(pop2)/num2)**2/( (np.var(pop1)/num1)**(2.0)/(num1-1) + (np.var(pop2)/num2)**(2.0)/(num2-1) )

Categories

Resources