OLS Regression: Scikit vs. Statsmodels? [closed]

OLS Regression: Scikit vs. Statsmodels? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).
Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.
I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.
For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.
For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709
That makes the warning go away but the results are exactly the same.
Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.
R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.
The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.
Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.
The dependent variable is how many levels each character gained during that week (int).
Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.
I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm
I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.
I would love to know:
Which output might be accurate? (Granted they might both be if I missed a kwarg.)
If I made a mistake, what is it and how to fix it?
Could I have figured this out without asking here, and if so how?
I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.
(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)
Thanks!

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# Generate artificial data (2 regressors + constant)
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e
# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 , 0.08612654, 0.60129898])
LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 , 0.08612654, 0.60129898])
As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).
I recommend you use pandas and patsy to take care of this:
import pandas as pd
from patsy import dmatrices
dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)
Or, alternatively, the statsmodels formula interface:
import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()
Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.
smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())
When in doubt, please
try reading the source code
try a different language for benchmark, or
try OLS from scratch, which is basic linear algebra.

i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

Related

linear ill-conditioned problems using sklearn.linear_model.Ridge - best way to describe training data?

Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.

Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.

Sample to choose when using Least square method v/s sklearn Regression method?

While using sklearn Linear Regression library, as we split the data using traintestsplit, do we have to use the training data for the OLS (least square method) or we can use the full data for OLS method and deduce the regression result.

There are many mistakes that data-scientists make as a beginner and one of them is to use test data as something in the learning process, look at this diagram from here:
As you can see the data is separated during training process and this is really important to be kept this way.
Now the question you ask is about least square method, while you may think that by using full data you are improving the process, you are forgetting about the evaluation part which then would be better not because the regression is better. It is just better because you have shown the model the data you are testing it with.

How to get CORRECT feature importance plot in XGBOOST?

Using two different methods in XGBOOST feature importance, gives me two different most important features, which one should be believed?
Which method should be used when? I am confused.
Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
df = sns.load_dataset('mpg')
df = df.drop(['name','origin'],axis=1)
X = df.iloc[:,1:]
y = df.iloc[:,0]
Numpy arrays
# fit the model
model_xgb_numpy = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_numpy.fit(X.to_numpy(), y.to_numpy())
plt.bar(range(len(model_xgb_numpy.feature_importances_)), model_xgb_numpy.feature_importances_)
Pandas dataframe
# fit the model
model_xgb_pandas = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_pandas.fit(X, y)
axsub = xgb.plot_importance(model_xgb_pandas)
Problem
Numpy method shows 0th feature cylinder is most important. Pandas method shows model year is most important. Which one is the CORRECT most important feature?
References
How to get feature importance in xgboost?
Feature importance 'gain' in XGBoost

It is hard to define THE correct feature importance measure. Each has pros and cons. It is a wide topic with no golden rule as of now and I personally would suggest to read this online book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/. The book has an excellent overview of different measures and different algorithms.
As a rule of thumb, if you can not use an external package, i would choose gain, as it is more representative of what one is interested in (one typically is not interested in raw occurrence of splits on a particular features, but rather how much those splits helped), see this question for a good summary: https://datascience.stackexchange.com/q/12318/53060. If you can use other tools, shap exhibits very good behaviour and I would always choose it over build-in xgb tree measures, unless computation time is strongly constrained.
As for the difference that you directly pointed at in your question, the root of the difference comes from the fact that xgb.plot_importance uses weight as the default extracted feature importance type, while the XGBModel itself uses gain as the default type. If you configure them to use the same importance type, then you will get similar distributions (up to additional normalisation in feature_importance_ and sorting in plot_importance).

There are 3 ways to get feature importance from Xgboost:
use built-in feature importance (I prefer gain type),
use permutation-based feature importance
use SHAP values to compute feature importance
In my post I wrote code examples for all 3 methods. Personally, I'm using permutation-based feature importance. In my opinion, the built-in feature importance can show features as important after overfitting to the data(this is just an opinion based on my experience). SHAP explanations are fantastic, but sometimes computing them can be time-consuming (and you need to downsample your data).

From the answer here, which gives a neat explanation:
feature_importances_ returns weights - what we usually think of as "importance".
plot_importance returns the number of occurrences in splits.
Note: I think that the selected answer above does not actually cover the point.

Principle component regression using python

I have strain temperature data and I have read that article
https://www.idtools.com.au/principal-component-regression-python-2/
I'm trying to build a model and predict the strain out of the temperature.
I have got the following results with cross validation is negative.
I have the data set here
http://www.mediafire.com/file/r7dg7i9dacvpl2j/curve_fitting_ahmed.xlsx/file
My question is Is it results of Cross validation makes sense ?
My code is the following
The input is dataframe from panda.
def pca_analysis(temperature, strain):
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Data
print("process data")
T1 = temperature['T1'].tolist()
W_A1 = strain[0]
N = len(T1)
xData = np.reshape(T1, (N, 1))
yData = np.reshape(W_A1, (N, 1))
# Define the PCA object
pca = PCA()
Xstd = StandardScaler().fit_transform(xData)
# Run PCA producing the reduced variable Xred and select the first pc components
Xreg = pca.fit_transform(Xstd)[:, :2]
''' Step 2: regression on selected principal components'''
# Create linear regression object
regr = linear_model.LinearRegression()
# Fit
regr.fit(Xreg,W_A1)
# Calibration
y_c = regr.predict(Xreg)
# Cross-validation
y_cv = cross_val_predict(regr, Xreg, W_A1, cv=10)
# Calculate scores for calibration and cross-validation
score_c = r2_score(W_A1, y_c)
score_cv = r2_score(W_A1, y_cv)
# Calculate mean square error for calibration and cross validation
mse_c = mean_squared_error(W_A1, y_c)
mse_cv = mean_squared_error(W_A1, y_cv)
print(mse_c)
print(mse_cv)
print(score_c)
print(score_cv)
# Regression plot
z = np.polyfit(W_A1, y_c, 1)
with plt.style.context(('ggplot')):
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(W_A1, y_c, c='red', s = 0.4, edgecolors='k')
ax.plot(W_A1, z[1] + z[0] * yData, c='blue', linewidth=1)
ax.plot(W_A1, W_A1, color='green', linewidth=1)
plt.title('$R^{2}$ (CV): ' + str(score_cv))
plt.xlabel('Measured $^{\circ}$Strain')
plt.ylabel('Predicted $^{\circ}$Strain')
plt.show()
Here is the result of PCR
How would I improve the prediction using that data ?
enter image description here

From the Scikit Documentation, the value given by r2_score can be negative if your model is arbitrarily worse than random. Now, obviously this is not what one wants from using ML; you expect better than random results.
The first thing I would note is that your data seems like it may be quite nonlinear, in which case PCA struggles to improve model performance.
One potential substitute for PCA which accounts for essentially any nonlinearities in data is the use of autoencoders to preprocess data (Good article on these here). They can account for nonlinearities in data if you use non-linear activation functions on some of your hidden layers of the autoencoder, which may help your model's performance. There are many articles around the web that explain this, let me know if you want some resources if you so choose to pursue this course
The next thing that I would note is that r2_score is really not the best way to measure error, and that using mean-squared error is much more popular, especially for linear regression. So, if you want to keep your model as simple as this, I would simply ignore the r2_score and move on from there. However, that being said, linear regression is not equipped to solve very complex problems due to its simplicity, and judging by the picture you provided, it's pretty clear to me that linear regression is very rough when applied to this dataset.
I would be interested to know the difference in mean-squared-error between the PCA and non-PCA applied data. Here, the PCA should have less error than the the normal, non-PCA applied data. If it does not, then either your data is horribly non-linear (maybe?) or there is an error in your code (I looked over it and nothing is immediately obviously wrong with it). For linear regression, mean-squared-error is really almost the unanimous error function of choice, and is remarkably effective. Hope this answers your question, leave a comment/question about my answer if you have one and I will try to clarify as best as I can.
Also, while answering your question, I cam across this other question that I believe explains your problem pretty well (and uses some math, so be prepared). Most notably, there are situations where R^2 error is appropriate to use for your model, but given your results, I would say that R^2 error would probably be a poor choice of error function for this data.
Update: Given the values that you get for the mean squared error, my first guess would be that PCA is 1) either not working bc of the nature of the data, or 2) is implemented incorrectly. While I am not an expert with the libraries you are using, I would make sure that you transform all of the data in the same way, i.e. make sure that the PCA transformed vectors are being compared with transformed vectors.
For moving on from linear regression, I would investigate into making a simple neural network or SVR (this might be a little trickier). Both these methods are proven to work well for complex data and are very adaptable. There tons of resources online for both of these things, and I think giving specifics on implementation of either of these methods might be out of the scope of this question (you might have to ask a separate one about this).

How to find which variables are retained in scikit learn

I am running some algorithms in scikit. Like Currently I use RandomisedLasso. But this question pertain to any ml algo in scikit.
My initial training data is 149x56. Now here is what I do:
from sklearn.linear_model import RandomizedLasso
est_rlasso = RandomizedLasso(max_iter=1000)
# Running Randomised Lasso
x=est_rlasso.fit_transform(tourism_X,tourism_Y)
x.shape
>>> (149x36).
So if you see it gives out 36 best features to be retained out of 56 initially and transforms the dataset from 149x56 to 149x36. But the problem is which 36 features did it retain? The biggest problem with scikit is that it strips off the variable headers. So now I am left clueless which features did this algorithm keep and which one it removed as the final X has no header to cross-check.
THis is common across any ml algorithm implementation in scikit. How does one overcome this? Like if I need to find which variables it gave as significant or if I am running a Regression model then the coefficient stand for which variables as I might have used Onehotencoder to transform categorical variables and then it would change the var order from original.
Any idea?

From the docs
get_support([indices]) Return a mask, or list, of the features/indices
selected.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLasso.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.