Python - Statistical distribution

Python - Statistical distribution - python

I'm quite new to python world. Also, I'm not a statistician. I'm in the need to implementing mathematical models developed by mathematicians in a computer science programming language. I've chosen python after some research. I'm comfortable with programming as such (PHP/HTML/javascript).
I have a column of values that I've extracted from a MySQL database & in need to calculate the below -
1) Normal distribution of it. (I don't have the sigma & mu values. These need to be calculated too apparently).
2) Mixture of normal distribution
3) Estimate density of normal distribution
4) Calculate 'Z' score
The array of values looks similar to the one below ( I've populated sample data)-
d1 = [3,3,3,3,3,3,3,9,12,6,3,3,3,3,9,21,3,12,3,6,3,30,12,6,3,3,24,30,3,3,3]
mu1, std1 = norm.fit(d1)
The normal distribution, I understand could be calculated as below -
import numpy as np
from scipy.stats import norm
mu, std = norm.fit(data)
Could I please get some pointers on how to get started with (2),(3) & (4) in this please? I'm continuing to look up online as I look forward to hear from experts.
If the question doesn't fully make sense, please do let me know what aspect is missing so that I'll try & get information around that.
I'd very much appreciate any help here please.

Some parts of your question are unclear. It might help to give the context of what you're trying to achieve, rather than what are the specific steps you're taking.
1) + 3) In a Normal distribution - fitting the distribution, and estimating the mean and standard deviation - are basically the same thing. The mean and standard deviation completely determine the distribution.
mu, std = norm.fit(data)
is tantamount to saying "find the mean and standard deviation which best fit the distribution".
4) Calculating the Z score - you'll have to explain what you're trying to do. This usually means how much above (or below) the mean a data point is, in units of standard deviation. Is this what you need here? If so, then it is simply
(np.array(data) - mu) / std
2) Mixture of normal distribution - this is completely unclear. It usually means that the distribution is actually generated by more than a single Normal distribution. What do you mean by this?

About (2), a web search for "mixture of Gaussians Python" should turn up a lot of hits.
The mixture of Gaussians is a pretty simple idea -- instead of a single Gaussian bump, the density contains multiple bumps. The density is a weighted sum $\sum_k \alpha_k g(x, \mu_k, \sigma_k^2)$ where the weights $\alpha_k$ are positive and sum to 1, and $g(x, \mu, \sigma^2)$ is a single Gaussian bump.
To determine the parameters $\alpha_k$, $\mu_k$, and $\sigma_k^2$, typically one uses the so-called expectation-maximization (EM) algorithm. Again a web search should find many hits. The EM algorithm for a Gaussian mixture is implemented in some Python libraries. It is not too complicated to write it yourself, but maybe to get started you can use an existing implementation.

Related

Sampling from gaussian distribution

My question is very specific. Given a k dimensional Gaussian distribution with mean and standard deviation, say I wish to sample 10 points from this distribution. But the 10 samples should be very different from each other. For example, I do not wish to sample 5 of those very close to the mean (By very close, we may assume for this example within 1 sigma) which may happen if I do random sampling. Let us also add an additional constraint that all the drawn samples should be at least 1 sigma away from each other. Is there a known way to sample in this fashion methodically? Is there any such module in PyTorch which can do so?
Sorry if this thought is ill posed but I am trying to understand if such a thing is possible.

To my knowledge there is no such library. The problem you are trying to solve is straightforward. Just check if the random number you get is 'far enough' from the mean. The complexity of that check is constant. The probability of a point not to be between one sigma from the mean is ~32%. It is not that unlikely.

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

alpha parameter from an alpha-stable distribution

Considering I have a collection of data. Let's say for example they are length 100. My hypothesis say that these data follow the alpha-stable distribution. Is there a way to calculate the alpha parameter of these data?
I would like to do that in python more specifically. All I found was that package
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levy_stable.html#scipy.stats.levy_stable
which just calculates an alpha-stable distribution given the parameters of the distribution.
I am not that familiar with alpha-stable distributions. I will try to make it more clear using an example of Poisson distribution. If I have some data that I know they follow Poisson distribution isn't it possible to calculate the λ of that distribution? (Is this possible or am I miss something from statistics theory?)

If I have some data that I know they follow Poisson distribution isn't it possible to calculate the λ of that distribution? (Is this possible or am I miss something from statistics theory?)
Sure. Mean of Poisson is equal to λ, so compute mean of your data and try ti use it. Because variance is equal to λ as well, there is quick check how Poisson your data are - compute variance as well and compare to mean/λ. If they are comparable, you're on a good track, though some MC sampling at the end might help as well.
Wrt alpha-stable distribution, I would start with computing data skewness, mean, median and mode. If data has no/little skew and mean, median and mode close, then one can assume that beta is 0 and mu is known. You have only two parameters left to define (alpha and c), and building PDF as FT and fitting might work

random variable from skewed distribution with scipy

trying to draw a random number from a distribution in SciPy, just like you would with stats.norm.rvs. However, I'm trying to take the number from an empirical distribution I have - it's a skewed dataset and I want to incorporate the skew and kurtosis into the distribution that I'm drawing from. Ideally I'd like to just call stats.norm.rvs(loc=blah,scale=blah,size=blah) and then also set the skew and kurt in addition to the mean and variance. The norm function takes a 'moments' argument consisting of some arrangement of 'mvsk' where the s and k stand for skew and kurtosis, but apparently all that does is ask that the s and k be computed from the rv, whereas I want to establish the s and k as parameters of the distribution to begin with.
Anyway, I'm not a statistics expert by any means, perhaps this is a simple or misguided question. Would appreciate any help.
EDIT: If the four moments aren't enough to define the distribution well enough, is there any other way to draw values that are consist with an empirical distribution that looks like this: http://i.imgur.com/3yB2Y.png

If you are not worried about getting out into the tails of the distribution,
and the data are floating point, then
you can sample from the empirical distribution.
Sort the the data.
Pre-pend a 0 to the data.
Let N denote the length of this data_array
Compute q=scipy.rand()*N
idx=int(q); di=q-idx
xlo=data_array[idx], xhi=data_array[idx+1];
return xlo+(xhi-xlo)*di
Basically, this is linearly interpolating in the empirical CDF to obtain
the random variates.
The two potential problems are (1) if your data set is small, you may not represent the
distribution well, and (2) you will not generate a value larger than the largest
one in your existing data set.
To get beyond those you need to look at parametric distributions, like the gamma distribution mentioned above.

The normal distribution has only 2 parameters, mean and variance. There are extensions of the normal distribution that have 4 parameters, with skew and kurtosis additional. One example would be Gram-Charlier expansion, but as far as I remember only the pdf is available in scipy, not the rvs.
As alternative there are distributions in scipy.stats that have 4 parameters like johnsonsu which are flexible but have a different parameterization.
However, in your example, the distribution is for values larger than zero, so an approximately normal distribution wouldn't work very well. As Andrew suggested, I think you should look through the distributions in scipy.stats that have a lower bound of zero, like the gamma, and you might find something close.
Another alternative, if your sample is large enough, would be to use gaussian_kde, which can also create random numbers. But gaussian_kde is also not designed for distribution with a finite bound.

Maybe I've misunderstood, I'm certainly not a stats expert, but your image looks quite a bit like a gamma distribution.
Scipy contains a code specifically for gamma distributions - http://www.scipy.org/doc/api_docs/SciPy.stats.distributions.html#gamma

short answer replace with other distribution if needed:
n = 100
a_b = [rand() for i in range(n)]
a_b.sort()
# len(a_b[:int(n*.8)])
c = a_b[int(n*.8)]
print c

Fitting data to distributions?

I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
Real-world Scenario:
For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?

This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.
There are two methods for setting parameters for a probability distribution function given data:
Least Squares
Maximum Likelihood
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
This gave me the following output:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.

Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).
A couple of quick things to note:
Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
fitdist allows you to fit any distributions you can define in terms of density and cdf.
You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.

This is probably a bit more general than you need, but might give you something to go on.
One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.
These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.
M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1,
Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.
The most general form of your question is the topic of a field known as non-parametric density estimation, where given:
data from a random process with an unknown distribution, and
constraints on the underlying process
...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).
Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.

I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.
Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.
Trigonometry, tangents... would be my last thought.
I'm not an expert, just another humble Web Developer =)

You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.
library("plyr")
library("e1071")
realData <- rnorm(1000) #Real data is normally distributed
distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp = "qexp")
#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
jpeg(paste(x, ".jpeg", sep = ""))
probplot(data, qdist = x)
dev.off()
}
l_ply(distToTest, function(x) testDist(x, realData))

For what it's worth, it seems like you might want to look at the Poisson distribution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.