I am using statsmodels.formula.api to preform linear regression. I have used three independent variables for prediction. In some cases I am getting negative value but all the output should be positive.
Is there any way to tell the model that the output can not be negative?
import statsmodels.formula.api as smf
output1 = smf.ols(formula= 'y ~A+B+C', data= data).fit()
output = output.predict(my_data)
One standard way to model a positive or non-negative dependent (or response or output) variable is by assuming a exponential mean function.
The expected value of the response given the covariates is E(y | x) = exp(x b).
One way to model this is to use Poisson regression, either statsmodels Poisson or GLM with family Poisson. Given that Poisson will not be the correct likelihood for a continuous variable we need to adjust the covariance of the parameter estimates for the misspecification, with cov_type='HC0'. That is we are using Quasi-Maximum Likelihood.
output1 = smf.poisson(formula= 'y ~A+B+C', data= data).fit(cov_type='HC0')
and alternative would be to log the response variable, which implicitly assumes a lognormal model.
http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/
https://stats.stackexchange.com/questions/8505/poisson-regression-vs-log-count-least-squares-regression
Note, statsmodels does not impose that the response variable in Poisson, Binomial, Logit and similar are integers, so we can use those models for quasi-maximum likelihood estimation with continuous data.
If you are trying to ensure that output values of your model are constrained within some bounds, linear regression is probably not an appropriate choice. It sounds like you might want logistic regression or some kind of model where the output falls within known bounds. Determining what kind of model you want might be a question for CrossValidated.
That being said, you can easily constrain your predictions after the fact - just set all the negative predictions to 0. Whether this makes any sense is a different question.
Related
can anyone let me know what is the method of estimating the parameters in fractional logit model in statsmodel package of python?
And can anyone refer me the specific part of the source code of fractional logit model?
I assume fractional Logit in the question refers to using the Logit model to obtain the quasi-maximum likelihood for continuous data within the interval (0, 1) or [0, 1].
The discrete models in statsmodels like GLM, GEE, and Logit, Probit, Poisson and similar in statsmodels.discrete, do not impose an integer condition on the response or endogenous variable. So those models can be used for fractional or positive continuous data.
The parameter estimates are consistent if the mean function is correctly specified. However, the covariance for the parameter estimates are not correct under quasi-maximum likelihood. The sandwich covariance is available with the fit argument, cov_type='HC0'. Also available are robust sandwich covariance matrices for cluster robust, panel robust or autocorrelation robust cases.
eg.
result = sm.Logit(y, x).fit(cov_type='HC0')
Given that the likelihood is not assumed to be correctly specified, the reported statistics based on the resulting maximized log-likelihood, i.e. llf, ll_null and likelihood ratio tests are not valid.
The only exceptions are multinomial (logit) models which might impose the integer constraint on the explanatory variable, and might or might not work with compositional data. (The support for compositional data with QMLE is still an open question because there are computational advantages to only support the standard cases.)
I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.
I would like to predict multiple dependent variables using multiple predictors. If I understood correctly, in principle one could make a bunch of linear regression models that each predict one dependent variable, but if the dependent variables are correlated, it makes more sense to use multivariate regression. I would like to do the latter, but I'm not sure how.
So far I haven't found a Python package that specifically supports this. I've tried scikit-learn, and even though their linear regression model example only shows the case where y is an array (one dependent variable per observation), it seems to be able to handle multiple y. But when I compare the output of this "multivariate" method to the results I get by manually looping over each dependent variable and predicting them independently from each other, the outcome is exactly the same. I don't think this should be the case, because there is a strong correlation between some of the dependent variables (>0.5).
The code just looks like this, with y either a n x 1 matrix or n x m matrix, and x and newx matrices of various sizes (number of rows in x == n).
ols = linear_model.LinearRegression()
ols.fit(x,y)
ols.predict(newx)
Does this function actually perform multivariate regression?
This is a mathematical/stats question, but I will try to answer it here anyway.
The outcome you see is absolutely expected. A linear model like this won't take correlation between dependent variables into account.
If you had only one dependent variable, your model would essentially consist of a weight vector
w_0 w_1 ... w_n,
where n is the number of features. With m dependent variables, you instead have a weight matrix
w_10 w_11 ... w_1n
w_20 w_21 ... w_2n
.... ....
w_m0 w_m1 ... w_mn
But the weights for different output variables (1, ..., m) are completely independent from each other, and since the total sum of squared errors splits up into a sum of squared errors over each output variable, minimizing the squared total loss is exactly the same as setting up one univariate linear model per output variable and minimizing their squared losses independently from each other.
If you want to take into account the correlation between the dependent variables, you probably need Partial least square regression. This method is basically searching for such projection of independent variables and such projection of the dependent variables, that the covariance between these two projections is maximized. See scikit-learn implementation here.
I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?
Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).
Given time-series data, I want to find the best fitting logarithmic curve. What are good libraries for doing this in either Python or SQL?
Edit: Specifically, what I'm looking for is a library that can fit data resembling a sigmoid function, with upper and lower horizontal asymptotes.
If your data were categorical, then you could use a logistic regression to fit the probabilities of belonging to a class (classification).
However, I understand you are trying to fit the data to a sigmoid curve, which means you just want to minimize the mean squared error of the fit.
I would redirect you to the SciPy function called scipy.optimize.leastsq: it is used to perform least squares fits.