Robust Linear Regression Results in Python and Stata Do Not Agree

Robust Linear Regression Results in Python and Stata Do Not Agree - python

My groupmates and I were doing this assignment that involves running a regression on Fama-French 3 factor model. I used python Statsmodels module and they used Stata and we share the same set of data. For Ordinary Least Squares regression, we got the same answers. But robust regression results for some reason don't agree.
Here is the result from Stata:
Here is the result from Statsmodels:
Just wondering what could be the cause of this issue? Any way to resolve it? I also tried different methods (HuberT, RamsayE etc) in Statsmodels and none of them had the same answers as the result from Stata. Any help is appreciated.

The equivalent of Stata's
regress ..., robust
in statsmodels is
OLS(...).fit(cov_type='HC1')
The options for the robust sandwich covariance matrices are here http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html, but the use is through the fit keywords.
There is an incomplete FAQ answer for differences in robust standard errors between Stata and statsmodels. https://github.com/statsmodels/statsmodels/issues/1923
statsmodel.robust and RLM refer to outlier robust estimation. This is an M-estimator and the covariance has the original Huber sandwich form.
Here is the main page for statsmodels.robust
http://www.statsmodels.org/devel/rlm.html
and the documentation for RLM
http://www.statsmodels.org/devel/generated/statsmodels.robust.robust_linear_model.RLM.html

Related

linear ill-conditioned problems using sklearn.linear_model.Ridge - best way to describe training data?

Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.

Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.

Results of sklearn/statsmodels ordinary least squares under singular covariance matrix

When computing ordinary least squares regression either using sklearn.linear_model.LinearRegression or statsmodels.regression.linear_model.OLS, they don't seem to throw any errors when covariance matrix is exactly singular. Looks like under the hood they use Moore-Penrose pseudoinverse rather than the usual inverse which would be impossible under singular covariance matrix.
The question is then twofold:
What is the point of this design? Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
What does it output as coefficients then? To my understanding since the covariance matrix is singular, there would be an infinite (in a sense of a scaling constant) number of solutions via pseudoinverse.

two related questions and answers
Differences in Linear Regression in R and Python
Statsmodels with partly identified model
To 1) Under what circumstances it is deemed useful to compute OLS regardless of whether the covariance matrix is singular?
Even though some parameters are not identified and picked by an "arbitrary" unique solution out of the infinte possible solutions, some results statsitics are not affected by the non-identification, the main ones are estimable linear combinations, prediction and r-squared.
Some linear combinations of parameters are identified even if not all parameters are identified separately. For example we can still test whether all means in a oneway categorical variable are equal. These are estimable functions even under singularity and the reason statsmodels inherited pinv behavior from its precursor package. However, statsmodels does not have functions to identify estimable functions from a singular covariance matrix of the parameter estimate.
We get a unique prediction for any values of the explanatory variables which is still useful if the perfect collinearity persists.
Some summary and inferential statistics like Rsquared are independent of the way unique parameters are chosen. This is sometimes convenient and used, for example, in diagnostics and specification tests where LM-test can be computed from rsquared.
To 2) What does it output as coefficients then?
The parameters estimated by Moore-Penrose inverse can be interpreted as symmetrically penalized or regularized estimates. The Moore-Penrose solution also obtains when we have Ridge Regression and the penalization weight goes to zero. (I don't remember where I read this.)
Also, in some cases with singular design, the indeterminacy only affects some parameters. Even though we have to be careful in what we infer about those parameters, other parameters might still be identified and unaffected by the perfectly collinear part.
A software package has essentially 3 options to handle singular cases
raise an exception and refuse to compute anything
drop some variables, question is which variables to drop
switch to penalized solution including generalized inverse
statsmodels picks 3 mainly because of the symmetric treatment of variables. R and Stata pick 2 in many models (where I think it's difficult to predict which variable is lost).
One reason for symmetric treatment is that it makes it easier to compare the same regression across many datasets, which will be more difficult if not always the same variable is dropped when using case 2.

That's indeed the case. As you can see here
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq or scipy.optimize.nnls, which in turn compute the pseudoinverse of the feature matrix via SVD decomposition (they do not exploit the Normal Equation - for which you would have the mentioned issue - as it is less efficient). Moreover, observe that each sklearn.linear_model.LinearRegression's instance returns the singular values of the feature matrix into the singular_ attribute and its rank into the rank_ attribute.
A similar argument applies to statsmodels.regression.linear_model.OLS, where the fit() method of class RegressionModel uses the following:
The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
(see here for reference).

I noticed the same thing, seems sklearn and statsmodel are pretty robust, a little too robust making you wondering how to interprete the results after all. Guess it is still up to the modeler to do due diligence to identify any collinearity between variables and eliminate unnecessary variables. Funny sklearn won't even give you pvalue, which is the most important measure out of these regression. When play with the variables, the coefficient will change, that is why I pay much more attention on pvalues.

Feature scaling for polynomial regression

Do we have to scale the polynomial features when creating a polynomial regression?
This question is already answered here and the answer is no. But when creating a model with scikit learn, I do observe a huge difference.
And I also found this article about the Importance of Feature Scaling in Data Modeling. And the example of polynomial features prove that the scaling does have an impact.
What did I miss ?

Maybe because of numerical issues; polynomials are prone to ill-conditioning: I have experienced (outside of machine learning) such problems with not-so-complex polynomial models, and the first solution was scaling the values of the features.
Symbolically the result is the same, but numerically can be much different, in some instances

Complex Regression Model in Python

For a project I am working on, I need to find a model for the data graphed below that includes a sine or cosine component (hard to tell from the image but the data does follow a trig-like function for each period, although the amplitude/max/mins are changing).
data
I originally planned on finding a simple regression model for my data using Desmos before I saw how complex the data was, but alas, I do not think I am capable of determining what equation to use without the help of Python. I don't have much experience with regression in Python, I've only done basic linear modeling where I knew the type of equation and was just determining the coefficients/constants. Could anyone offer a guiding example, git code, or resources that would be useful for this?

Your question is pretty generic and looking at the graph, we cannot tell much about the data to give you a more detailed answer, but i'd say have a look at OLS
https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
You could also look at scikit learn for the various regression models it provides.
http://scikit-learn.org/stable/modules/linear_model.html
Essentially,these packages will help you figure our the equation you are looking to have for your data.
Also, looks like your graph has an outlier ? Please note regression is very sensitive to outliers, so you may want to handle those data points before fitting the model.

Multivariable regression attribute selection in python

I'm a beginner to using statsmodels & I'm also open to using other Python based methods of solving my problem:
I have a data set with ~ 85 features some of which are highly correlated.
When I run the OLS method I get a helpful 'strong multicollinearity problems' warning as I might expect.
I've previously run this data through Weka, which as part of the regression classifier has an eliminateColinearAttributes option.
How can I do the same thing - get the model to chose which attributes to use instead of having them all in the model?
Thanks!

To run multivariate regression use scipy.stats.linregress. Check out this nice example which has a good explanation.
The eliminateColinearAttributes option in the software you've mentioned is just some algorithm implemented in this software to fight the problem. Here, you need to implement some iterative algorithm yourself based on elimination of one of highly correlated variables with the highest p-value (then run regression again and repeat until multicollinearity is not there).
There's no one and only way here, there are different techniques. It is also a good practice to choose manually from the set of highly correlated with each other set of variables which to omit that it also makes sense.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.