Python - Multiple Linear Regression - Coefficient of Determination for each Input Variable - python

I am performing a fairly straight forward multiple linear regression in Python using sklearn. See code snippet below - full_results is a dataframe in which all variables are numeric.
The results of this code is a single coefficient of determination which I believe denotes how much change in y is due to the combination of x1 - x4.
My question is whether the coefficient of determination can be split out between the 4 input variables, so I can see how much change in y is attributed to each variable individually.
I can of course run a single variable linear regression for each variable independently, but this doesn't feel like the right solution.
I have a memory of being in stats class many years ago and doing something similar in R.
from sklearn.linear_model import LinearRegression
x = full_results[['x1','x2','x3','x4']].values
y = full_results['y'].values
mlr = LinearRegression()
mlr.fit(x, y)
mlr.score(x, y)

The coefficient of determination is the proportion of total variance explained. So another way of looking at it is to see the proportion of variance explained by each term, also explained here. For this we use an anova to calculate the sum of squares for each term.
One thing you have to take note is that this works if your predictors are not correlated. If they are, then the order in each they are specified in the model would make a difference in the calculation.
Using an example dataset:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import pandas as pd
X,y = make_regression(n_samples=100, n_features=4,
n_informative=3, noise=20, random_state=99)
df = pd.DataFrame(X,columns = ['x1','x2','x3','x4'])
df['y'] = y
mlr = LinearRegression()
mlr.fit(df[['x1','x2','x3','x4']], y)
mlr.coef_
array([ 8.33369861, 29.1717497 , 26.6294007 , -1.82445836])
mlr.score(df[['x1','x2','x3','x4']], y)
0.8465893941639528
It's easier to calculate this with statsmodels and make a linear fit, you can see the coefficients will be pretty similar:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
lm = ols('y ~ x1 + x2 + x3 + x4',df).fit()
lm.params
Intercept -0.740399
x1 8.333699
x2 29.171750
x3 26.629401
x4 -1.824458
We get the anova:
anova_table = anova_lm(lm)
anova_table
df sum_sq mean_sq F PR(>F)
x1 1.0 10394.554366 10394.554366 28.605241 6.110239e-07
x2 1.0 113541.846572 113541.846572 312.460911 8.531356e-32
x3 1.0 66267.787822 66267.787822 182.365304 7.899193e-24
x4 1.0 298.584632 298.584632 0.821688 3.669804e-01
Residual 95.0 34521.039456 363.379363 NaN NaN
Everything except the residuals in sum square column gives you r-squared similar to that from sklearn:
anova_table['sum_sq'][:-1].sum() / anova_table['sum_sq'].sum()
0.8465893941639528
Now the proportion of variance explained (we seldom call it r-squared) for example 'x1' is:
anova_table.loc['x1','sum_sq'] / anova_table['sum_sq'].sum()
0.046193130558342954

Related

Polynomial transforms in regression result in multiple p-values for each variable. Is there a single, proxy value that can represent these p-values?

I use Python's statsmodels.formula.api for most regression tasks. When I test a large number of variables in a model, I check their p-values to be confident that the variables are actually improving the model.
I usually apply polynomial transformations to variables to test whether that improves the fit. For example, this is the equation for a polynomial transform of 3 degrees to variable x:
ŷ = C + β₁x + β₂x² + β₃x³
My problem is that each term of the variable x has its own p-value. That is, there's a separate p-value for the x, x² and x³ terms.
In a multivariate regression model, where the predictors X comprises n individual variables x₁, x₁, ... xₙ, I have no way to assess the p-value of xᵢ if I applied a polynomial transform to it.
In this code, I apply a 3-degree polynomial transform to X:
import statsmodels.formula.api as smf
X = data["predictor"]
y = data["outcome"]
model = smf.ols(
formula="outcome ~ predictor + I(predictor**2) + I(predictor**3)",
data=data
)
results = model.fit()
print(results.pvalues)
This will yield 4 p-values: one for the constant/intercept, and one for each term of X. Can I somehow combine the p-values of X, or otherwise develop a proxy measure for them that is a single value?

Shap statistics

I used shap to determine the feature importance for multiple regression with correlated features.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap
boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target
X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']
fit = LinearRegression().fit(X, Y)
explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary
# shapely values where `correlated' is not
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type = 'bar')
shap offers a chart to get the shap values. Is there also a statistic available? I am interested in the exact shap values. I read the Github repository and the documentation but I found nothing regarding this topic.
When we look at shap_values we see that it contains some positive and negative numbers, and its dimensions equal the dimensions of boston dataset. Linear regression is a ML algorithm, which calculates optimal y = wx + b, where y is MEDV, x is feature vector and w is a vector of weights. In my opinion, shap_values stores wx - a matrix with the value of the each feauture multiplyed by the vector of weights calclulated by linear regression.
So to calculate wanted statistics, I first extracted absolute values and then averaged over them. The order is important! Next I used initial column names and sorted from biggest effect to smallest one. With this, I hope I have answered your question!:)
from matplotlib import pyplot as plt
#rataining only the size of effect
shap_values_abs = np.absolute(shap_values)
#dividing to get good numbers
means_norm = shap_values_abs.mean(axis = 0)/1e-15
#sorting values and names
idx = np.argsort(means_norm)
means = np.array(means_norm)[idx]
names = np.array(boston.feature_names)[idx]
#plotting
plt.figure(figsize=(10,10))
plt.barh(names, means)

Kernel ridge and simple Ridge with Polynomial features

What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)?
The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original features. This document provides a good overview in general.
Regarding the computation we can inspect the relevant parts from the source code:
Ridge Regression
The actual computation starts here (for the default settings); you can compare with equation (5) in the above linked document. The computation involves computing the dot product between feature vectors (the kernel), then the dual coefficients (alpha) and finally a dot product with the feature vectors in order to obtain the weights.
Kernel Ridge
Similarly computes the dual coefficients and stores them (instead of computing some weights). This is because when making predictions, again the kernel between training and prediction samples is computed. The result is then dotted with the dual coefficients.
The computation of the (training) kernel follows a similar procedure: compare Ridge and KernelRidge. The major difference is that Ridge explicitly considers the dot product between whatever (polynomial) features it has received while for KernelRidge these polynomial features are generated implicitly during the computation. For example consider a single feature x; with gamma = coef0 = 1 the KernelRidge computes (x**2 + 1)**2 == (x**4 + 2*x**2 + 1). If you consider now PolynomialFeatures this will provide features x**2, x, 1 and the corresponding dot product is x**4 + x**2 + 1. Hence the dot product differs by a term x**2. Of course we could rescale the poly-features to have x**2, sqrt(2)*x, 1 while with KernelRidge(kernel='poly') we don't have this kind of flexibility. On the other hand the difference probably doesn't matter (in most cases).
Note that also the computation of the dual coefficients is performed in a similar manner: Ridge and KernelRidge. Finally KernelRidge keeps the dual coefficients while Ridge directly computes the weights.
Let's see a small example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.extmath import safe_sparse_dot
np.random.seed(20181001)
a, b = 1, 4
x = np.linspace(0, 2, 100).reshape(-1, 1)
y = a*x**2 + b*x + np.random.normal(scale=0.2, size=(100,1))
poly = PolynomialFeatures(degree=2, include_bias=True)
xp = poly.fit_transform(x)
print('We can see that the new features are now [1, x, x**2]:')
print(f'xp.shape: {xp.shape}')
print(f'xp[-5:]:\n{xp[-5:]}', end='\n\n')
# Scale the `x` columns so we obtain similar results.
xp[:, 1] *= np.sqrt(2)
ridge = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge.fit(xp, y)
krr = KernelRidge(alpha=0, kernel='poly', degree=2, gamma=1, coef0=1)
krr.fit(x, y)
# Let's try to reproduce some of the involved steps for the different models.
ridge_K = safe_sparse_dot(xp, xp.T)
krr_K = krr._get_kernel(x)
print('The computed kernels are (alomst) similar:')
print(f'Max. kernel difference: {np.abs(ridge_K - krr_K).max()}', end='\n\n')
print('Predictions slightly differ though:')
print(f'Max. difference: {np.abs(krr.predict(x) - ridge.predict(xp)).max()}', end='\n\n')
# Let's see if the fit changes if we provide `x**2, x, 1` instead of `x**2, sqrt(2)*x, 1`.
xp_2 = xp.copy()
xp_2[:, 1] /= np.sqrt(2)
ridge_2 = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge_2.fit(xp_2, y)
print('Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:')
print(f'Max. difference: {np.abs(ridge_2.predict(xp_2) - ridge.predict(xp)).max()}', end='\n\n')
print('Interpretability of the coefficients changes though:')
print(f'ridge.coef_[1:]: {ridge.coef_[0, 1:]}, ridge_2.coef_[1:]: {ridge_2.coef_[0, 1:]}')
print(f'ridge.coef_[1]*sqrt(2): {ridge.coef_[0, 1]*np.sqrt(2)}')
print(f'Compare with: a, b = ({a}, {b})')
plt.plot(x.ravel(), y.ravel(), 'o', color='skyblue', label='Data')
plt.plot(x.ravel(), ridge.predict(xp).ravel(), '-', label='Ridge', lw=3)
plt.plot(x.ravel(), krr.predict(x).ravel(), '--', label='KRR', lw=3)
plt.grid()
plt.legend()
plt.show()
From which we obtain:
We can see that the new features are now [x, x**2]:
xp.shape: (100, 3)
xp[-5:]:
[[1. 1.91919192 3.68329762]
[1. 1.93939394 3.76124885]
[1. 1.95959596 3.84001632]
[1. 1.97979798 3.91960004]
[1. 2. 4. ]]
The computed kernels are (alomst) similar:
Max. kernel difference: 1.0658141036401503e-14
Predictions slightly differ though:
Max. difference: 0.04244651134471766
Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:
Max. difference: 7.15642822779472e-14
Interpretability of the coefficients changes though:
ridge.coef_[1:]: [2.73232239 1.08868872], ridge_2.coef_[1:]: [3.86408737 1.08868872]
ridge.coef_[1]*sqrt(2): 3.86408737392841
Compare with: a, b = (1, 4)
this is an example to show it:
from sklearn.datasets import make_friedman1
plt.figure()
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100,
n_features = 7, random_state=0)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)
X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
random_state = 0)
linreg = Ridge().fit(X_train, y_train)
print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
.format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))
(poly deg 2 + ridge) linear model coeff (w):
[ 0. 2.23 4.73 -3.15 3.86 1.61 -0.77 -0.15 -1.75 1.6 1.37 2.52
2.72 0.49 -1.94 -1.63 1.51 0.89 0.26 2.05 -1.93 3.62 -0.72 0.63
-3.16 1.29 3.55 1.73 0.94 -0.51 1.7 -1.98 1.81 -0.22 2.88 -0.89]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825
I assume you have known how the kernel ridge regression (KRR) and PolynomialFeatures + Ridge work. They are somewhat the same. I will list some mirror differences between them.
You can switch off the bias feature in PolynomialFeatures, and include it in the Ridge. The regularization term of Ridge doesn't include the bias. On the contrary, for KRR of sklearn, the penalty term always includes the bias term.
You can scale the features generated by PolynomialFeatures before you use Ridge. it's equal to customize the regularization strength for each polynomial feature. So PolynomialFeatures = Ridge is little more flexible. On the contrary, you have only two parameters to tune in the polynomial kernel, i.e. the gamma and the c_0, see polynomial kernel.
The fit and prediction time is different. You need to solve the system of linear equations K_NxN x=y$ in KRR. You need only to solve the system of linear equations A_Nx(D+1) x=y$, where N is the number of samples in training, and D the degree of the polynomial.
(This is a very very corner case) Kernel will be (almost) singular if two samples are (near) identical. And when alpha (regularization strength) is very small. you will meet the numerical stability problem. since the K + alpha*I is almost singular. You can only overcome this problem by using the Ridge. The reason why Ridge will work is explained in many machine learning textbooks.

how to get standardised (Beta) coefficients for multiple linear regression using statsmodels

when using the .summary() function using pandas statsmodels, the OLS Regression Results include the following fields.
coef std err t P>|t| [0.025 0.975]
How can I get the standardised coefficients (which exclude the intercept), similarly to what is achievable in SPSS?
You just need to standardize your original DataFrame using a z distribution (i.e., z-score) first and then perform a linear regression.
Assume you name your dataframe as df, which has independent variables x1, x2, and x3, and dependent variable y. Consider the following code:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf
# standardizing dataframe
df_z = df.select_dtypes(include=[np.number]).dropna().apply(stats.zscore)
# fitting regression
formula = 'y ~ x1 + x2 + x3'
result = smf.ols(formula, data=df_z).fit()
# checking results
result.summary()
Now, the coef will show you the standardized (beta) coefficients so that you can compare their influence on your dependent variable.
Notes:
Please keep in mind that you need .dropna(). Otherwise, stats.zscore will return all NaN for a column if it has any missing values.
Instead of using .select_dtypes(), you can select column manually but make sure all the columns you selected are numeric.
If you only care about the standardized (beta) coefficients, you can also use result.params to return it only. It will usually be displayed in a scientific-notation fashion. You can use something like round(result.params, 5) to round them.
We can just transform the estimated params by the standard deviation of the exog. results.t_test(transformation) computes the parameter table for the linearly transformed variables.
AFAIR, the following should produce the beta coefficients and corresponding inferential statistics.
Compute standard deviation, but set it to 1 for the constant.
std = model.exog.std(0)
std[0] = 1
Then use results.t_test and look at the params_table. np.diag(std) creates a diagonal matrix that transforms the params.
tt = results.t_test(np.diag(std))
print(tt.summary()
tt.summary_frame()
you can convert unstandardized coefficients by taking std deviation. Standardized Coefficient (Beta) is the requirement for the driver analysis. Below is the code that works for me.
X is independent variables and y is dependent variable and coefficients are coef which are extracted by (model.params) from ols.
sd_x = X.std()
sd_y = Y.std()
beta_coefficients = []
# Iterate through independent variables and calculate beta coefficients
for i, col in enumerate(X.columns):
beta = coefficients[i] * (sd_x[col] / sd_y)
beta_coefficients.append([col, beta])
# Print beta coefficients
for var, beta in beta_coefficients:
print(f' {var}: {beta}')

Comparing computational and analytic results of linear regression

Consider simple one feature linear regression. x = features, w = weights
We have w for the best fit to the linear regression model as,
w = (xTx)^(-1)xTy
Now I am comparing results I got from scikit learn regressor and computational w method and they have significant difference among them.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Salary_Data.csv')
x = data.iloc[:,[0]].values
y = data.iloc[:,[1]].values
#space
x_t = np.transpose(x)
first_inv = np.matmul(x_t, x)
second = np.matmul(x_t, y)
first = np.linalg.inv(first_inv)
theta = np.matmul(first, second)
y_prad = theta*x
#space
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x, y)
y_prad2 = regressor.predict(x)
#space
plt.scatter(x, y)
plt.plot(x, y_prad , 'red')
plt.plot(x, y_prad2, 'green')
Where am I wrong here?(whatever in concepts or code)
You are forgetting the intercept term. Add a column of ones to the x matrix using
np.insert(x, 0, 1, axis=1)
and then re-run the calculations. The shape of x should be (30, 2) where the first column is all 1's to represent the constant multiplied by the intercept. The final shape of theta should be (2, 1) where the first term is the intercept and the second is the slope.
Here is a good reference for matrix formulation of linear regression.
Matrix Formulation of Linear Regression

Categories

Resources