Statsmodels fit distribution among 0 and 1 - python

I am trying to fit a beta distribution that should be defined between 0 and 1 on a data set that only has samples in a subrange. My problem is that using the fit() function will cause the fitted PDF to be defined only between my smallest and largest values.
For instance, if my dataset has samples between 0.2 and 0.3, what I get is a PDF defined between 0.2 and 0.3, instead of between 0 and 1, as it should be. The code I am using is:
ps1 = beta.fit(selected, loc=0, scale=1)
Am I missing something?

So:
you know that the distribution has a=0 and b=1 lower and upper bounds,
but the sample does not contain any values close to these limits.
This may happen if the distribution truly is a Beta distribution and the alpha and beta parameters are so that the density near 0 and 1 is zero.
In this case, I would suggest to use the maximum likelihood method, restricting the active parameters to alpha and beta, with known a and b parameters.
This is easy with the MaximumLikelihoodFactory class of OpenTURNS, which has a setKnownParameter method. This method allows to restrict the parameters which are optimized by the maximum likelihood method.
To reproduce this situation, I created a Beta distribution with the following parameters.
import openturns as ot
distribution = ot.Beta(3.0, 2.0, 0.0, 1.0)
sampleSize = 100
sample = distribution.getSample(sampleSize)
Fitting a Beta distribution with known a and b parameters is straightforward.
factory = ot.MaximumLikelihoodFactory(distribution)
factory.setKnownParameter([0.0, 1.0], [2, 3])
inf_distribution = factory.build(sample)
The list [0.0, 1.0] contains the values of the a and b parameters and the indices [2, 3] are the indices of the parameters in the Beta distribution.
This produces :
Beta(alpha = 3.02572, beta = 1.88172, a = 0, b = 1)
with sample I simulated.

I came up with a partial solution that does the trick for me: I replicate my samples (for the datasets that are too small) and add dummy samples at 0 and 1. Although that increases the fit error, it is low enough for my purpose.
Also, I asked in Google groups and got this answer that works fine, but its giving me some errors occasionally. I hope this helps anyone with that problem.

Related

scipy.stats cdf greater than 1

I'm using scipy.stats and I need the CDF up to a given value x for some distributions, I know PDFs can be greater than 1 because they are not probabilities but densities so they should integrate to 1 even if specific values are greater, but CDFs should never be greater than 1 and when running the cdf function on scipy.stats sometimes I get values like 2.89, i'm completely sure i'm using cdf and not pdf(that was my first guess), this is messing my results and algorithm because I need accumulated probabilities, why is scipy.stats cdf returning values greater than 1 and/or how should I proceed to fix it?
Code for reproducing the issue with a sample distribution and parameters(but it happens with others too):
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, -38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
cdf: [2.68047481 7.2027761 7.2027761 ]
pdf: [2.76857133 2.23996739 2.23996739]
The problem lies in the parameters that you have specified for the Gaussian hypergeometric (HG) distribution, specifically in the third element of params, which is the parameter beta in the HG distribution (see equation 2 in this paper for the definiton of the density of the Gauss Hypergeometric distr.). This parameter has to be positive for HG to have a valid density. Otherwise, the density won't integrate to 1, which is exactly what is happening in your example. With a negative beta, the distribution is not a valid probability distribution.
You can also find the requirement that beta (denoted as b) has to be positive in the scipy documentation here.
Changing beta to a positive parameter immediately solves your problem:
from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, 38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]
arg = params[:-2]
loc = params[-2]
scale = params[-1]
print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
Output:
cdf: [1. 1. 1.]
pdf: [3.83898392e-32 1.25685346e-35 1.25685346e-35]
,where all cdfs integrate to 1 as desired. Also note that your x also has to be between 0 and 1, as described in the scipy documentation here.

Choosing random number where probability is random in Python

While I can find decent information on how to generate numbers based on probabilities for picking each number with numpy.random.choice e.g.:
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
which picks 0 with probability p =.1, 1 with p = 0, 2 with p = .3, 3 with p = .6 and 4 with p = 0.
What I would like to know is, what function is there that will vary the probabilities? So for example, one time I might have the probability distribution above and the next maybe p=[0.25, .1, 0.18, 0.2, .27]). So I would like to generate probability distributions on the fly. Is there a python python library that does this?
What I am wanting to do is to generate arrays, each of length n with numbers from some probability distribution, such as above.
One good option is the Dirichlet distribution: samples from this distribution lie in a K-dimensional simplex aka a multinomial distribution.
Naturally there's a convenient numpy function for generating as many such random distributions as you'd like:
# 10 length-4 probability distributions:
np.random.dirichlet((1,1,1,3),size = 10)
And these would get fed to the p= argument in your np.random.choice call.
You can consult Wikipedia for more info about how the tuple parameter affects the sampled multinomial distributions.
AFAIK there's no inbuilt way to do this. You can do roulette wheel selection which should accomplish what you want.
The basic idea is simple:
def roulette(weights):
total = sum(weights)
mark = random.random() * total
runner = 0
for index, val in enumerate(weights):
runner += val
if runner >= mark:
return index
You can read more at https://en.wikipedia.org/wiki/Fitness_proportionate_selection

How to implement a KS-Test in Python

scipy.stats.kstest(rvs, cdf, N) can perform a KS-Test on a dataset rvs. It tests if the dataset follows a propability distribution, whose cdf is specified in the parameters of this method.
Consider now a dataset of N=4800 samples. I have performed a KDE on this data and, therefore, have an estimated PDF. This PDF looks an awful lot like a bimodal distribution. When plotting the estimated PDF and curve_fitting a bimodal distribution to it, these two plots are pretty much identical. The parameters of the fitted bimodal distribution are (scale1, mean1, stdv1, scale2, mean2, stdv2):
[0.6 0.036 0.52, 0.23 1.25 0.4]
How can I apply scipy.stats.kstest to test if my estimated PDF is bimodal distributed?
As my null hypothesis, I state that the estimated PDF equals the following PDF:
hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)
x_grid is just a vector that contains the x-values at which I evaluate my estimated PDF. So each entry of pdf has a corresponding value of x_grid. It might be that my computation of hypoCdf is incorrect. Maybe instead of dividing by len(x_grid), should I divide by np.sum(hypoDist) ?
Challenge: cdf parameter of kstest cannot be specified as bimodal. Neither can I specify it to be hypoDist.
If I wanted to test whether my dataset was Gaussian distributed, I would write:
KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)
measurementError is the dataset that I have performed the KDE on. This returns:
statistic=0.459, pvalue=0.0
To me, it is a little irritating that the pvalue is 0.0
The cdf argument to kstest can be a callable that implements the cumulative distribution function of the distribution against which you want to test your data. To use it, you have to implement the CDF of your bimodal distribution. You want the distribution to be a mixture of two normal distributions. You can implement the CDF for this distribution by computing the weighted sum of the CDFs of the two normal distributions that make up the mixture.
Here's a script that shows how you can do this. To demonstrate how kstest is used, the script runs kstest twice. First it uses a sample that is not from the distribution. As expected, kstest computes a very small p-value for this first sample. It then generates a sample that is drawn from the mixture. For this sample, the p-value is not small.
import numpy as np
from scipy import stats
def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
"""
CDF of a mixture of two normal distributions.
"""
return (weight1*stats.norm.cdf(x, mean1, stdv1) +
(1 - weight1)*stats.norm.cdf(x, mean2, stdv2))
# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4
n = 200
# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)
# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)
# Create a sample from the bimodal distribution. This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution. The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
(stats.norm.rvs(mean2, stdv2, size=n - n1))))
# Most of time, the p-value returned by kstest with sample2 will not
# be small. We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)
Typical output (the numbers will be different each time the script is run):
sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403
You might find that, for your problem, this test does not work well. You have 4800 samples, but in your code you have parameters whose numerical values have just one or two significant digits. Unless you have good reason to believe that your sample is drawn from a distribution with exactly those parameters, it is likely that kstest will return a very small p-value.

Program error and questions regarding max. log-likelihood

I'm trying to calculate the maximum log-likelihood (MLE) for the following probability density function (PDF):
I'm computing it by minimising the objective function (negative log-likelihood) without relying on any predefined log-likelihood python built-in modules whatsoever. The code is:
# Alpha Distribution (PDF)
def AD(z, *params):
a, scale = z
diameters = params
return -np.sum(np.log((((diameters)/(a**2) * np.exp(-diameters/a))) / scale))
# load data
currpath = ('path')
os.chdir(currpath)
diameters = scipy.io.loadmat('data.mat')["m1"]
# minimise
x0 = [1,1] # initial guesses
res = optimize.minimize(AD, x0, args = diameters, method='Nelder-Mead',
tol=1e-6)
print(res.x)
My data vector (here already sorted) comprises a number of diameters in the following form (0.19, 0.19, 0.19, 0.2, 0.21, 0.21, 0.22, 0.22, 0.22, 0.25, 0.27 ...).
First question: Since I'm fairly new to the topic of MLE, is the form of my data vector correct? I'm not completely sure whether I use a data vector containing every observed diameter (like shown above), or a data vector which only contains the "possible" diameters (which would be: 0.19, 0.2, 0.21, 0.22, 0.25, 0.27 ...), or just the frequencies of the observed diameters (which would be: 3, 1, 2, 3, 1, 1 ...). I think the first option is the right one, but I just wanted to be completely sure.
Second question: If I wish to use a cumulative distribution function (CDF) instead of a PDF to perform my MLE on, I would have to change my PDF function to a CDF, right? I was just wondering if I could alternatively somehow modify my data vector and still use the PDF.
However, for the minimisation in python (if I understood it correctly) I had to rethink the definition of my variables. That means, normally I would assume that the parameters of my PDF (here "a" and "scale") are the variables which should be passed to "args" in "optimize.minimize". However, in the documentation it is stated, that args should contain the "constant" parameters, therefore I used my data vector as a constant "parameter vector" for the minimisation.
Third question: Is this assumption an error in reasoning?
Fourth question: Is the optimisation method "Nelder-Mead" appropriate? I'm not really familiar with optimisation methods and not sure which of the options I should use/is the best.
Finally, the program returns an error "TypeError: bad operand type for unary -: 'tuple'", where I have no clue how to deal with it, since I'm not passing any tuples to the minimisation function ...
Fifth question: Where does the tuple come from and how can I solve this error?
I'd appreciate any help you could give me very much!
Best regards!
PS: Since this post is kind of a mixture between general math and programming, I wasn't completely sure if this is the right place to put the question. Sorry if I'm mistaken!
First, apart from the first part (before the multiplication operator), we are discussing what is generally called maximum likelihood estimation (MLE) for the exponential distribution. It has just been reparameterised in terms of something called a.
We want to estimate this single parameter based on a sample of diameters; there is no scale parameter. Under MLE, we pretend that the sample is fixed and treat the parameter as something that can be varied. We form the likelihood of the sample by taking the product of the density functions (not the cdfs) where each density function is to be calculated for one element of the sample.
(Likelihood is, in concept, like throwing a die twice. In ultra ugly terms, we could say that the likelihood of getting two ones in a row might be (1/6)(1/6).)
We want to maximise this likelihood. However, to make the optimisation problem mathematically and/or computationally tractable we take the function's logarithm. Since all of its constituent functions are densities, less than one, this function must be everywhere less than zero. Thus, the maximisation problem becomes one of minimisatiion.
If you want to avoid almost all of the algebra then you would:
Write a function to calculate the density function for a given diameter and parameter value.
Write another function that would accept a density function parameter value as its Python parameter, and the sample as its second. Make it call the first function once for each sample value, take the log of each of these and return the sum of these.
Call minimize with the second function as its first argument, some reasonable guess for the density function parameter, in a list, as the second argument, the sample for args. Nelder-Mead is probably ok.
Edit: In a nutshell:
diameters =[ 0.19, 0.19, 0.19, 0.2, 0.21, 0.21, 0.22, 0.22, 0.22, 0.25, 0.27]
from scipy.optimize import minimize
from math import exp, log
def pdf(d, a):
result = d*exp(-d/a)/a**2
return result
def log_L(a, diameters):
result = sum(log(pdf(d, a)) for d in diameters)
return result
res = minimize(log_L, [1], args=diameters)
print (res)
Output:
fun: -337.80985348524604
hess_inv: array([[ 8.71770021e+10]])
jac: array([ -7.62939453e-06])
message: 'Optimization terminated successfully.'
nfev: 93
nit: 30
njev: 31
status: 0
success: True
x: array([ 2157576.39996697])
Addendum:
The wikipedia article offers the following form for the pdf of the exponential.
The constant 'lambda' can be viewed as a value that scales the integral of the remainder of the expression from zero to infinity to one. We can ignore it and equate the exponents of your pdf, without the scaling factor, and the exponential. We have to remember that d takes the role of x.
Solve for 'lambda'.
We see that this is the normalising expression in your pdf. In other words, the alpha is an exponential expressed with different parameters.
Here is another approach, assuming that you're analysing data and not simply working out the details of MLE.
scipy provides means for generating samples from arbitrary distributions. Here I define just the pdf for your alpha. Your parameter a becomes p because a is used as the lower limit for the distribution support, which I define to be zero.
I draw a sample of size 100 with p set somewhat arbitrarily to 0.4. I did a little experimentation, trying to find a value that would give me a sample whose lowest 11 values would approximate those in your sample.
The scipy rv_continuous object has a method called fit that will attempt calculation of MLE estimates of location, scale and 'shape'. In this case, the value for shape, about 0.36, is not all that far from 0.4.
from scipy.stats import rv_continuous
import numpy as np
class Alpha(rv_continuous):
'alpha distribution'
def _pdf(self, x, p):
return x*np.exp(-x/p)/p**2
alpha = Alpha(a=0, shapes='p')
sample = sorted(alpha.rvs(size=100,p=0.4))
for a in sample[:12]:
print ('{:10.2f}'.format(a))
print (Alpha(a=0, shapes='p').fit(sample))
I don't believe that your sample is alpha-distributed. The values seem to be too 'uniform' compared with what I could generate. But I've been wrong before.
I would suggest plotting your sample cdf to see if you can recognise what it is.
Incidentally, when I changed the sign of the log-likelihood in the other answer the code croaked. I suspect that the alpha is just a poor fit.
0.00
0.03
0.04
0.04
0.08
0.09
0.09
0.11
0.12
0.14
0.19
0.20
(1.0902616847853124, -0.039102949269294023, 0.35922022997329517)

Multivariate normal CDF in Python

I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)
I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.
The only thing that I found is this: Multivariate Normal CDF in Python using scipy
but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.
This is just a clarification of the points that #sascha made above in the comments for the answer. The relevant function can be found here:
As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:
from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn
for i in range(1, 20, 2):
cov_example = np.array(((i, 0), (0, i)))
mean_example = np.array((0, 0))
print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))
The output of this is 0.25, 0.25, 0.25, 0.25...
The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.
What most people mean when they ask for a p_value of some point in relation to some distribution is:
what is the chance of getting these values or higher given this distribution?
Note the area marked in red - it is not a point, but rather an integral from some point onwards:
Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:
from sys import maxsize
def mvn_p_value(x, mu, cov_matrix):
upper_bounds = np.array([maxsize] * x.size) # make an upper bound the size of your vector
p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
if 0.5 < p_value: # this inversion is used for two-sided statistical testing
p_value = 1 - p_value
return p_value

Categories

Resources