difference of two random distributions python - python

Good day!
I have two gamma distributions, and want to find distribution of their difference.
Use np.random.gamma to generate distribution by parameters, but the resulting distribution is very different from time to time.
Code:
import numpy as np
from scipy.stats import gamma
for i in range(0, 10):
s1 = np.random.gamma(1.242619972, 0.062172619, 2000) + 0.479719122
s2 = np.random.gamma(456.1387112, 0.002811328, 2000) - 0.586076723
r_a, r_loc, r_scale = gamma.fit(s1 - s2)
print(1 - gamma.cdf(0.0, r_a, r_loc, r_scale))
Result:
0.4795655021157602
0.07061938039031612
0.06960741675590854
0.4957568913729331
0.4889900326940878
0.07381963810128422
0.0690800784280835
0.07198551429809896
0.07659274505827551
0.06967441935502583
I receive two quite different cdf of 0.: 0.48 and 0.07. What can be the problem?

You're fitting a gamma distribution to the difference between two other gamma distributions. A gamma distribution can only be positive, so that makes no sense and you can't expect to get a consistent answer. If you print the mean difference you get consistent results.

Related

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

Multivariate normal CDF in Python

I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value

ks and chisquare test reject equality of distributions for data coming from same DGP

I generated two distributions using the following code:
rand_num1 = 2*np.random.randn(10000) + 1
rand_num2 = 2*np.random.randn(10000) + 1
stats.ks_2samp(rand_num1, rand_num2)
My question is why do both these distributions do not test to be the same based on kstest and chisquare test.
When I run a kstest on the 2 distributions I get:
Ks_2sampResult(statistic=0.019899999999999973, pvalue=0.037606196570126725)
which implies that the two distributions are statistically different. I use the following code to plot the CDF of the two distributions:
count1, bins = np.histogram(rand_num1, bins = 100)
count2, _ = np.histogram(rand_num2, bins = bins)
plt.plot(np.cumsum(count1), 'g-')
plt.plot(np.cumsum(count2), 'b.')
This is how the CDF of two distributions looks.
When I run a chisquare test I get the following:
stats.chisquare(count1, count2) # Gives an nan output
stats.chisquare(count1+1, count2+1) # Outputs "Power_divergenceResult(statistic=180.59294741316694, pvalue=1.0484033143507713e-06)"
I have 3 questions below:
Even though the CDF looks the same and the data comes from same distribution why do kstest and chisquare test both reject the same distribution hypothesis? Is there an underlying assumption that I am missing here?
Some counts are 0 and hence the first chisquare() gives an error. Is it an accepted practice to just add a non-0 number to all counts to get a correct estimate?
Is there a kstest to test against non standard distributions, say a normal with a non 0 mean and std != 1?
CDF, in my humble opinion, is not a good curve to look at. It will hide a lot of details due to the fact that it is an integral. Basically, some outlier in distribution which is way below will be compensated by another outlier which is way above.
Ok, lets take a look at distribution of K-S results. I've run the test 100 times and plotted statistics vs p-value, and, as expected, for some cases there would be (small p, large stat) points.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(12345)
x = []
y = []
for k in range(0, 100):
rand_num1 = 2.0*np.random.randn(10000) + 1.0
rand_num2 = 2.0*np.random.randn(10000) + 1.0
q = stats.ks_2samp(rand_num1, rand_num2)
x.append(q.statistic)
y.append(q.pvalue)
plt.scatter(x, y, alpha=0.1)
plt.show()
Graph
UPDATE
In reality if I run a test and see the test vs control distribution of my metric as shown in my plot then I would want to be able to say that they are they same - are there any statistics or parameters around these tests that can tell me how close these distributions are?
Of course, they are - and you're using one of such tests! K-S is most general but weakest test. And as with any test you would use there are ALWAYS cases where test will say those samples come from different distributions even you deliberately sample them from the same routine. It is just NATURE of the things,
you'll get yes or no with some confidence, but not much more. Look
at the graph again for illustrations.
Concerning your exercises with chi2 I'm very skeptical from the beginning to use chi2 for such task. For me, given the problem of making decision about two samples, test to be used should be explicitly symmetric. K-S is ok, but looking at the definition of chi2, it is NOT symmetric. Simple modification of
your code
count1, bins = np.histogram(rand_num1, bins = 40, range=(-2.,2.))
count2, _ = np.histogram(rand_num2, bins = bins, range=(-2.,2.))
q = stats.chisquare(count2, count1)
print(q)
q = stats.chisquare(count1, count2)
print(q)
produces something like
Power_divergenceResult(statistic=87.645335824746468, pvalue=1.3298580128472864e-05)
Power_divergenceResult(statistic=77.582358201839526, pvalue=0.00023275129585256563)
Basically, it means that test may pass if you run (1,2) but fail if you run (2,1), which is not good, IMHO. Chi2 is ok with me as soon as you test against expected values from known distribution curve - here test asymmetry makes sense
I would advice to try Anderson-Darling test along the lines
q = stats.anderson_ksamp([np.sort(rand_num1), np.sort(rand_num2)])
print(q)
But remember, it is the same as with K-S, some samples may fail to pass the test even if they are drawn from the same underlying distribution - this is just the nature of the beast.
UPDATE: Some reading material
https://stats.stackexchange.com/questions/187016/scipy-chisquare-applied-on-continuous-data

P-value from Chi sq test statistic in Python

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.
I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:
from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846
However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.
Any ideas?
Cheers,
Davy
Quick refresher here:
Probability Density Function: think of it as a point value; how dense is the probability at a given point?
Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?
In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
PDF
CDF
To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:
>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189
Notice:
chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead
Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:
>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311
While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:
>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147
Using 1-chi2.cdf() breaks down here:
>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0
Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:
>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
You meant to do:
>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)
Example:
from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
If you want to understand the math, the p-value of a sample, x (fixed), is
P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)
where,
P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.
So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).
for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)
In [7]: from scipy.stats import chi2
In [8]: k = 1
In [9]: z = 2
In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:
>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Update by another user (WestCoastProjects) When using the values from the OP we get:
np.exp(np.longdouble(r.pchisq(3.84,1, lower_tail=False, log_p=True)[0]))
Out[5]: 0.050043521248705198928
So there's that 0.05 you were looking for.

How do I get a lognormal distribution in Python with Mu and Sigma?

I have been trying to get the result of a lognormal distribution using Scipy. I already have the Mu and Sigma, so I don't need to do any other prep work. If I need to be more specific (and I am trying to be with my limited knowledge of stats), I would say that I am looking for the cumulative function (cdf under Scipy). The problem is that I can't figure out how to do this with just the mean and standard deviation on a scale of 0-1 (ie the answer returned should be something from 0-1). I'm also not sure which method from dist, I should be using to get the answer. I've tried reading the documentation and looking through SO, but the relevant questions (like this and this) didn't seem to provide the answers I was looking for.
Here is a code sample of what I am working with. Thanks.
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
total = 37
dist = lognorm.cdf(total,mean,stddev)
UPDATE:
So after a bit of work and a little research, I got a little further. But I still am getting the wrong answer. The new code is below. According to R and Excel, the result should be .7434, but that's clearly not what is happening. Is there a logic flaw I am missing?
dist = lognorm([1.744],loc=2.0785)
dist.cdf(25) # yields=0.96374596, expected=0.7434
UPDATE 2:
Working lognorm implementation which yields the correct 0.7434 result.
def lognorm(self,x,mu=0,sigma=1):
a = (math.log(x) - mu)/math.sqrt(2*sigma**2)
p = 0.5 + 0.5*math.erf(a)
return p
lognorm(25,1.744,2.0785)
> 0.7434
I know this is a bit late (almost one year!) but I've been doing some research on the lognorm function in scipy.stats. A lot of folks seem confused about the input parameters, so I hope to help these people out. The example above is almost correct, but I found it strange to set the mean to the location ("loc") parameter - this signals that the cdf or pdf doesn't 'take off' until the value is greater than the mean. Also, the mean and standard deviation arguments should be in the form exp(Ln(mean)) and Ln(StdDev), respectively.
Simply put, the arguments are (x, shape, loc, scale), with the parameter definitions below:
loc - No equivalent, this gets subtracted from your data so that 0 becomes the infimum of the range of the data.
scale - exp μ, where μ is the mean of the log of the variate. (When fitting, typically you'd use the sample mean of the log of the data.)
shape - the standard deviation of the log of the variate.
I went through the same frustration as most people with this function, so I'm sharing my solution. Just be careful because the explanations aren't very clear without a compendium of resources.
For more information, I found these sources helpful:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html#scipy.stats.lognorm
https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy
And here is an example, taken from #serv-inc 's answer, posted on this page here:
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
It sounds like you want to instantiate a "frozen" distribution from known parameters. In your example, you could do something like:
from scipy.stats import lognorm
stddev = 0.859455801705594
mean = 0.418749176686875
dist=lognorm([stddev],loc=mean)
which will give you a lognorm distribution object with the mean and standard deviation you specify. You can then get the pdf or cdf like this:
import numpy as np
import pylab as pl
x=np.linspace(0,6,200)
pl.plot(x,dist.pdf(x))
pl.plot(x,dist.cdf(x))
Is this what you had in mind?
from math import exp
from scipy import stats
def lognorm_cdf(x, mu, sigma):
shape = sigma
loc = 0
scale = exp(mu)
return stats.lognorm.cdf(x, shape, loc, scale)
x = 25
mu = 2.0785
sigma = 1.744
p = lognorm_cdf(x, mu, sigma) #yields the expected 0.74341
Similar to Excel and R, The lognorm_cdf function above parameterizes the CDF for the log-normal distribution using mu and sigma.
Although SciPy uses shape, loc and scale parameters to characterize its probability distributions, for the log-normal distribution I find it slightly easier to think of these parameters at the variable level rather than at the distribution level. Here's what I mean...
A log-normal variable X is related to a normal variable Z as follows:
X = exp(mu + sigma * Z) #Equation 1
which is the same as:
X = exp(mu) * exp(Z)**sigma #Equation 2
This can be sneakily re-written as follows:
X = exp(mu) * exp(Z-Z0)**sigma #Equation 3
where Z0 = 0. This equation is of the form:
f(x) = a * ( (x-x0) ** b ) #Equation 4
If you can visualize equations in your head it should be clear that the scale, shape and location parameters in Equation 4 are: a, b and x0, respectively. This means that in Equation 3 the scale, shape and location parameters are: exp(mu), sigma and zero, respectfully.
If you can't visualize that very clearly, let's rewrite Equation 2 as a function:
f(Z) = exp(mu) * exp(Z)**sigma #(same as Equation 2)
and then look at the effects of mu and sigma on f(Z). The figure below holds sigma constant and varies mu. You should see that mu vertically scales f(Z). However, it does so in a nonlinear manner; the effect of changing mu from 0 to 1 is smaller than the effect of changing mu from 1 to 2. From Equation 2 we see that exp(mu) is actually the linear scaling factor. Hence SciPy's "scale" is exp(mu).
The next figure holds mu constant and varies sigma. You should see that the shape of f(Z) changes. That is, f(Z) has a constant value when Z=0 and sigma affects how quickly f(Z) curves away from the horizontal axis. Hence SciPy's "shape" is sigma.
Even more late, but in case it's helpful to anyone else: I found that the Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,TRUE)
provides the same results as python's
from scipy.stats import lognorm
lognorm.cdf(x,sigma,0,mean)
Likewise, Excel's
LOGNORM.DIST(x,Ln(mean),standard_dev,FALSE)
seems equivalent to Python's
from scipy.stats import lognorm
lognorm.pdf(x,sigma,0,mean).
#lucas' answer has the usage down pat. As a code example, you could use
import math
from scipy import stats
# standard deviation of normal distribution
sigma = 0.859455801705594
# mean of normal distribution
mu = 0.418749176686875
# hopefully, total is the value where you need the cdf
total = 37
frozen_lognorm = stats.lognorm(s=sigma, scale=math.exp(mu))
frozen_lognorm.cdf(total) # use whatever function and value you need here
Known mean and stddev of the lognormal distribution
In case someone is looking for it, here is a solution for getting the scipy.stats.lognorm distribution if the mean mu and standard deviation sigma of the lognormal distribution are known. In this case we have to calculate the stats.lognorm parameters from the known mu and sigma like so:
import numpy as np
from scipy import stats
mu = 10
sigma = 3
a = 1 + (sigma / mu) ** 2
s = np.sqrt(np.log(a))
scale = mu / np.sqrt(a)
This was obtained by looking into the implementation of the variance and mean calculations in the stats.lognorm.stats method and essentially reversing it (solving for the input).
Then we can initialize the frozen distribution instance
distr = stats.lognorm(s, 0, scale)
# generate some randomvals
randomvals = distr.rvs(1_000_000)
# calculate mean and variance using the dedicated method
mu_stats, var_stats = distr.stats("mv")
Compare means and stddevs from input, randomvals and analytical solution from distr.stats:
print(f"""
Mean Std
----------------------------
Input: {mu:6.2f} {sigma:6.2f}
Randomvals: {randomvals.mean():6.2f} {randomvals.std():6.2f}
lognorm.stats: {mu_stats:6.2f} {np.sqrt(var_stats):6.2f}
""")
Mean Std
----------------------------
Input: 10.00 3.00
Randomvals: 10.00 3.00
lognorm.stats: 10.00 3.00
Plot PDF from stats.lognorm and histogram of the random values:
import holoviews as hv
hv.extension('bokeh')
x = np.linspace(0, 30, 301)
counts, _ = np.histogram(randomvals, bins=x)
counts = counts / counts.sum() / (x[1] - x[0])
(hv.Histogram((counts, x))
* hv.Curve((x, distr.pdf(x))).opts(color="r").opts(width=900))
If you read this and just want a function with the behaviour similar to lnorm in R. Well, then relieve yourself from violent anger and use numpy's numpy.random.lognormal.

Categories

Resources