regression model for skewed distribution in python - python

enter image description here
I would like to make a linear regression model.
Predictor variable: 'Sud grenoblois / Vif PM10' has a decaying exponent distribution. you could see on the graph. As fas as I know, regression supposes
normal distribution of the predictor. Should I use a transform of variables or another type of regression?

You can apply logarithm transformation to reduce right skewness.
If the tail is to the left of data, then the common transformations include square, cube root and logarithmic.
I don't think you need to change your regression model.

Related

Python package for parameter estimation in PGMs with continuous variables

I am looking for a Python package to work on a probabilistic graphical model (PGM) with categorical and continous variables. Some nodes in the PGM are latent variables. The conditional probabilities are defined by continuous functions such as Beta distribution and logistic function. I would like to estimate the parameters of these probability distributions from data using expectation maximization (EM).
I checked pgmpy, and it seems to let model conditional probabilities with Gaussian functions and some other continuous formats. But as far as I understand, pgmpy only allows sampling data points from such PGMs. I am not sure if pgmpy can estimate the parameters (say, mean and std of the Gaussian distribution) in such cases.
Is there a Python package that can:
define the PGM structure by defining the nodes and edges
associate conditional probability function with each edge in form of Beta distribution or similar functions
use a built-in inference algorithm (such as EM) to learn the parameters (say alpha, beta of the Beta distributions) of the conditional probabilities

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

statsmodels: Method used to generate condifence intervals for quantile regression coefficients?

I am using the statsmodels.formulas.api.quantreg() for quantile regression in Python. I see that when fitting the quantile regression model, there is an option to specify the significance level for confidence intervals of the regression coefficients, and the confidence interval result appears in the summary of the fit.
What statistical method is being used to generate confidence intervals about the regression coefficients? It does not appear to be documented and I've dug through the source code for quantile_regression.py and summary.py to find this with no luck. Can anyone shed some light on this?
Inference for parameters is the same across models and is mostly inherited from the base classes.
Quantile regression has a model specific covariance matrix of the parameters.
tvalues, pvalues, confidence intervals, t_test and wald_test are all based on the assumption of an asymptotic normal distribution of the estimated parameters with the given covariance, and are "generic".
Linear models like OLS and WLS, and optionally some other models can use the t and F distribution instead of normal and chisquare distribution for the Wald test based inference.
specifically conf_int is defined in statsmodels.base.models.LikelihoodModelResults
partial correction:
QuantReg uses t and F distributions for inference, since it is currently treated as a linear regression model, and not normal and chisquare distributions as the related M-estimators, RLM, in statsmodels.robust.
Most models have now a use_t option to choose the inference distributions, but it hasn't been added to QuantReg.

SVM, scikit-learn: Decision values with RBF kernel

I have read somewhere that it's not possible interpret SVM decision values on non-linear kernels, so only the sign matters. However, I saw couple of articles putting a threshold on decision values (with SVMlight though) [1] [2]. So i'm not sure whether putting thresholds on decision values logical as well but i'm curious on the results anyways.
So, LibSVM python interface directly returns the decision values with predicted target when you call predict(), is there any way to do it with scikit-learn? I have trained a binary classification SVM model using svm.SVC(), but got stuck there right now.
In source codes i have found svm.libsvm.decision_function() function commented as "(libsvm name for this is predict_values)". Then i have seen the svm.SVC.decision_function() and checked its source code:
dec_func = libsvm.decision_function(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_, self._label,
self.probA_, self.probB_,
svm_type=LIBSVM_IMPL.index(self._impl),
kernel=kernel, degree=self.degree, cache_size=self.cache_size,
coef0=self.coef0, gamma=self._gamma)
# In binary case, we need to flip the sign of coef, intercept and
# decision function.
if self._impl in ['c_svc', 'nu_svc'] and len(self.classes_) == 2:
return -dec_func
It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction output (except probability estimates and Platt's method, my model is not good when probability estimates are calculated)? Or as it has been argued, the only sign matters for decision value in non-linear kernels?
[1] http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039195#pone.0039195-Teng1
[2] http://link.springer.com/article/10.1007%2Fs00726-011-1100-2
It seems like it's doing the libsvm's predict equivalent, but why does it changes the sign of decision values, if it's the equivalent of ?
These are just implementation hacks regarding internal representation of class signs. Nothing to truly be worried about.
sklearn decision_function is the value of inner product between SVM's hyerplane w and your data x (possibly in the kernel induced space), so you can use it, shift or analyze. Its interpretation, however is very abstract, as in case of rbf kernel it is simply the integral of the product of normal distribution centered in x with variance equal to 1/(2*gamma) and the weighted sum of normal distributions centered in support vectors (and the same variance), where weights are alpha coefficients.
Also, is there any way to calculate confidence value for an SVM decision using this value or any prediction
Platt's scaling is used not because there is some "lobby" forcing us to - simply this is the "correct" way of estimating SVM's confidence. However, if you are not interested in "probability sense" confidence, but rather any value that you can qualitatively compare (which point is more confident) than decision function can be used to do it. It is roughly the distance between the point image in kernel space and the separating hyperplane (up to the normalizing constant being the norm of w). So it is true, that
abs(decision_function(x1)) < abs(decision_function(x2)) => x1 is less confident than x2.
In short - bigger the decision_function value, the "deeper" the point is in its hyperplane.

Python or SQL Logistic Regression

Given time-series data, I want to find the best fitting logarithmic curve. What are good libraries for doing this in either Python or SQL?
Edit: Specifically, what I'm looking for is a library that can fit data resembling a sigmoid function, with upper and lower horizontal asymptotes.
If your data were categorical, then you could use a logistic regression to fit the probabilities of belonging to a class (classification).
However, I understand you are trying to fit the data to a sigmoid curve, which means you just want to minimize the mean squared error of the fit.
I would redirect you to the SciPy function called scipy.optimize.leastsq: it is used to perform least squares fits.

Categories

Resources