MCMC implementation in Python

MCMC implementation in Python - python

I have the following problem:
There are 12 samples around 20000 elements each from unknown distributions (sometimes the distributions are not uni-modal so it's hard to automatically estimate an analytical family of the distributions).
Based on these distributions I compute different quantities. How can I explore the distribution of the target quantity in the most efficient (and simplest) way?
To be absolutely clear, here's a simple example: quantity A is equal to B*C/D
B,C,D are distributed according to unknown laws but I have samples from their distributions and based on these samples I want to compute the distribution of A.
So in fact what I want is a tool to explore the distribution of the target quantity based on samples of the variables.
I know that there are MCMC algorithms to do that. But does anybody know a good implementation of an MCMC sampler in Python or C? Or are there any other ways to solve the problem?
Maxim

Have you taken a look to pymc? As it says in its description: "pymc is a python package that implements the Metropolis-Hastings algorithm as a python class, and is extremely flexible and applicable to a large suite of problems" So you can use Metropolis-Hastings for obtaining a sequence of random samples.

The simplest way to explore the distribution of A is to generate samples based on the samples of B, C, and D, using your rule. That is, for each iteration, draw one value of B, C, and D from their respective sample sets, independently, with repetition, and calculate A = B*C/D.
If the sample sets for B, C, and D have the same size, I recommend generating a sample for A of the same size. Much fewer samples would result in loss of information, much more samples would not gain much. And yes, even though many samples will not be drawn, I still recommend drawing with repetition.

Related

Sampling from gaussian distribution

My question is very specific. Given a k dimensional Gaussian distribution with mean and standard deviation, say I wish to sample 10 points from this distribution. But the 10 samples should be very different from each other. For example, I do not wish to sample 5 of those very close to the mean (By very close, we may assume for this example within 1 sigma) which may happen if I do random sampling. Let us also add an additional constraint that all the drawn samples should be at least 1 sigma away from each other. Is there a known way to sample in this fashion methodically? Is there any such module in PyTorch which can do so?
Sorry if this thought is ill posed but I am trying to understand if such a thing is possible.

To my knowledge there is no such library. The problem you are trying to solve is straightforward. Just check if the random number you get is 'far enough' from the mean. The complexity of that check is constant. The probability of a point not to be between one sigma from the mean is ~32%. It is not that unlikely.

Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?

In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.

Problems with computing the entropy of random variables via SciPy (stats)

Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.

For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.

Solve multi-objectives optimization of a graph in Python

I'm trying to find what seems to be a complicated and time-consuming multi-objective optimization on a large-ish graph.
Here's the problem: I want to find a graph of n vertices (n is constant at, say 100) and m edges (m can change) where a set of metrics are optimized:
Metric A needs to be as high as possible
Metric B needs to be as low as possible
Metric C needs to be as high as possible
Metric D needs to be as low as possible
My best guess is to go with GA. I am not very familiar with genetic algorithms, but I can spend a little time to learn the basics. From what I'm reading so far, I need to go as such:
Generate a population of graphs of n nodes randomly connected to
each other by m = random[1,2000] (for instance) edges
Run the metrics A, B, C, D on each graph
Is an optimal solution found (as defined in the problem)?
If yes, perfect. If not:
Select the best graphs
Crossover
Mutate (add or remove edges randomly?)
Go to 3.
Now, I usually use Python for my little experiments. Could DEAP (https://code.google.com/p/deap/) help me with this problem?
If so, I have many more questions (especially on the crossover and mutate steps), but in short: are the steps (in Python, using DEAP) easy enough to be explain or summarized here?
I can try and elaborate if needed. Cheers.

Disclaimer: I am one of DEAP lead developer.
Your individual could be represented by a binary string. Each bit would indicate whether there is an edge between two vertices. Therefore, your individuals would be composed of n * (n - 1) / 2 bits, where n is the number of vertices. To evaluate your individual, you would simply need to build an adjacency matrix from the individual genotype. For an evaluation function example, see the following gist https://gist.github.com/cmd-ntrf/7816665.
Your fitness would be composed of 4 objectives, and based on what you said regarding minimization and maximization of each objective, the fitness class would be created like this :
creator.create("Fitness", base.Fitness, weights=(1.0, -1.0, 1.0, -1.0)
The crossover and mutation operators could be the same as in the OneMax example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_onemax_short.html
However, since you want to do multi-objective, you would need a multi-objective selection operator, either NSGA2 or SPEA2. Finally, the algorithm would have to be mu + lambda. For both multi-objective selection and mu + lambda algorithm usage, see the GA Knapsack example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_knapsack.html
So essentially, to get up and running, you only have to merge a part of the onemax example with the knapsack while using the proposed evaluation function.

I suggest the excellent pyevolve library https://github.com/perone/Pyevolve. This will do most of the work for you, you will only have to define the fitness function and your representation nodes/functions. You can specify the crossover and mutation rate as well.

Fitting data to distributions?

I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
Real-world Scenario:
For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?

This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.
There are two methods for setting parameters for a probability distribution function given data:
Least Squares
Maximum Likelihood
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
This gave me the following output:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.

Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).
A couple of quick things to note:
Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
fitdist allows you to fit any distributions you can define in terms of density and cdf.
You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.

This is probably a bit more general than you need, but might give you something to go on.
One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.
These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.
M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1,
Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.
The most general form of your question is the topic of a field known as non-parametric density estimation, where given:
data from a random process with an unknown distribution, and
constraints on the underlying process
...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).
Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.

I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.
Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.
Trigonometry, tangents... would be my last thought.
I'm not an expert, just another humble Web Developer =)

You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.
library("plyr")
library("e1071")
realData <- rnorm(1000) #Real data is normally distributed
distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp = "qexp")
#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
jpeg(paste(x, ".jpeg", sep = ""))
probplot(data, qdist = x)
dev.off()
}
l_ply(distToTest, function(x) testDist(x, realData))

For what it's worth, it seems like you might want to look at the Poisson distribution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.