I am looking for a Python package to work on a probabilistic graphical model (PGM) with categorical and continous variables. Some nodes in the PGM are latent variables. The conditional probabilities are defined by continuous functions such as Beta distribution and logistic function. I would like to estimate the parameters of these probability distributions from data using expectation maximization (EM).
I checked pgmpy, and it seems to let model conditional probabilities with Gaussian functions and some other continuous formats. But as far as I understand, pgmpy only allows sampling data points from such PGMs. I am not sure if pgmpy can estimate the parameters (say, mean and std of the Gaussian distribution) in such cases.
Is there a Python package that can:
define the PGM structure by defining the nodes and edges
associate conditional probability function with each edge in form of Beta distribution or similar functions
use a built-in inference algorithm (such as EM) to learn the parameters (say alpha, beta of the Beta distributions) of the conditional probabilities
Related
I would like to fit a multiple linear regression. It will have a couple input parameters and no intercept.
Is it possible to fit a glm fit with statsmodels then use that fit to simulate a distribution of predicted values?
In the next, upcoming 0.14 release, there is a get_distribution method based on predicted distribution parameters that return an instance of a scipy or scipy compatible distribution class. The methods of those include rvs for simulating data. (excluding Tweedie family for which no distribution class exists yet)
Some of this is already available in earlier version.
For some distributions families like gaussian, poisson, binomial the parameters are directly available. For other distributions like negative binomial or gamma there is a conversion from GLM parameterization to standard distribution parameterization.
For example for gaussian family, the dependent variable is normally distributed with mean given by results.fittedvalues or results.predict(x) and variance given by results.scale
using from scipy import stats
scale_n = scale / var_weights
return stats.norm(loc=mu, scale=np.sqrt(scale_n))
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/genmod/families/family.py#L675
Note:
scale in scipy.stats distributions corresponds to the standard deviation, while it corresponds to the variance in GLM and OLS.
var_weights are like weights in WLS for the case when variance differs across observations (heteroscedasticity). By default it is just ones like in OLS and can be ignored. (var_weights are currently only implemented for GLM.)
For non-gaussian families, var_weights is a multiplicative factor for the inherent heteroscedasticity where variance is a function of the mean.
Context:
in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these
parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters.
The parameters of this prior distribution are called the hyperparameters.
Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters)
and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by
the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior
distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g.
here
where it is stated that
"Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters
do not directly determine the kernel parameters.
Then an example is given of the exponential kernel and its length-scale parameter.
This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters.
This is confusing and it is now unclear if the package uses the Bayesian approach at all.
For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer:
the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".
This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters,
so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.
enter image description here
I would like to make a linear regression model.
Predictor variable: 'Sud grenoblois / Vif PM10' has a decaying exponent distribution. you could see on the graph. As fas as I know, regression supposes
normal distribution of the predictor. Should I use a transform of variables or another type of regression?
You can apply logarithm transformation to reduce right skewness.
If the tail is to the left of data, then the common transformations include square, cube root and logarithmic.
I don't think you need to change your regression model.
I am using the statsmodels.formulas.api.quantreg() for quantile regression in Python. I see that when fitting the quantile regression model, there is an option to specify the significance level for confidence intervals of the regression coefficients, and the confidence interval result appears in the summary of the fit.
What statistical method is being used to generate confidence intervals about the regression coefficients? It does not appear to be documented and I've dug through the source code for quantile_regression.py and summary.py to find this with no luck. Can anyone shed some light on this?
Inference for parameters is the same across models and is mostly inherited from the base classes.
Quantile regression has a model specific covariance matrix of the parameters.
tvalues, pvalues, confidence intervals, t_test and wald_test are all based on the assumption of an asymptotic normal distribution of the estimated parameters with the given covariance, and are "generic".
Linear models like OLS and WLS, and optionally some other models can use the t and F distribution instead of normal and chisquare distribution for the Wald test based inference.
specifically conf_int is defined in statsmodels.base.models.LikelihoodModelResults
partial correction:
QuantReg uses t and F distributions for inference, since it is currently treated as a linear regression model, and not normal and chisquare distributions as the related M-estimators, RLM, in statsmodels.robust.
Most models have now a use_t option to choose the inference distributions, but it hasn't been added to QuantReg.
I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?
Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).