Related
I used shap to determine the feature importance for multiple regression with correlated features.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap
boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target
X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']
fit = LinearRegression().fit(X, Y)
explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary
# shapely values where `correlated' is not
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type = 'bar')
shap offers a chart to get the shap values. Is there also a statistic available? I am interested in the exact shap values. I read the Github repository and the documentation but I found nothing regarding this topic.
When we look at shap_values we see that it contains some positive and negative numbers, and its dimensions equal the dimensions of boston dataset. Linear regression is a ML algorithm, which calculates optimal y = wx + b, where y is MEDV, x is feature vector and w is a vector of weights. In my opinion, shap_values stores wx - a matrix with the value of the each feauture multiplyed by the vector of weights calclulated by linear regression.
So to calculate wanted statistics, I first extracted absolute values and then averaged over them. The order is important! Next I used initial column names and sorted from biggest effect to smallest one. With this, I hope I have answered your question!:)
from matplotlib import pyplot as plt
#rataining only the size of effect
shap_values_abs = np.absolute(shap_values)
#dividing to get good numbers
means_norm = shap_values_abs.mean(axis = 0)/1e-15
#sorting values and names
idx = np.argsort(means_norm)
means = np.array(means_norm)[idx]
names = np.array(boston.feature_names)[idx]
#plotting
plt.figure(figsize=(10,10))
plt.barh(names, means)
I tried to run a Ridge Regression on Boston housing data with python, but I had the following questions that I cannot find answer anywhere so I decided to post it here:
Is scaling recommended before fitting the model? Because I get the same score when I scale and when I don't scale. Also, what is the interpretation of the alpha/coeff graph in terms of choosing the best alpha?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
df = pd.read_csv('../housing.data',delim_whitespace=True,header=None)
col_names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df.columns = col_names
X = df.loc[:,df.columns!='MEDV']
col_X = X.columns
y = df['MEDV'].values
# Feature Scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
clf = Ridge()
coefs = []
alphas = np.logspace(-6, 6, 200)
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X_std, y)
coefs.append(clf.coef_)
plt.figure(figsize=(20, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Alpha/coefficient graph for scaled X
Alpha/coefficient graph for unscaled X
On the scaled data, when I compute the score and choose the alpha thanks to CV, I get:
from sklearn.linear_model import RidgeCV
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 7]).fit(X_std, y)
> clf.score(X_std, y)
> 0.74038
> clf.alpha_
> 5.0
On the non-scaled data, I even get a slightly better score with a completely different alpha:
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 6]).fit(X, y)
> clf.score(X, y)
> 0.74064
> clf.alpha_
> 0.01
Thanks for your insights on the matter, looking forward to reading your answers!
I think you should scale because Ridge Regularization penalizes large values, and so you don't want to lose meaningful features because of scaling issues. Perhaps you don't see a difference because the housing data is a toy dataset and is already scaled well.
A larger alpha is a stronger penalty on large values. The graph is showing you (though it has no labeling) that with a stronger alpha you send coefficients to zero more strongly. The more gradual lines are the smaller weights, so they're effected less or almost not at all until alpha becomes sufficiently large. The sharper ones are larger weights, so they drop to zero more quickly. When they do, the feature will disappear from your regression.
For the scaled data, the magnitude of design matrix is smaller, and the coefficients tend to be larger (and more L2 penalty is imposed). To minimize L2, we need more and more small coefficients. How to get more and more small coefficients? The way is to choose a very big alpha, so we can have more smaller coefficients. That is why if you scale the data, the optimal alpha is a great number.
What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)?
The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original features. This document provides a good overview in general.
Regarding the computation we can inspect the relevant parts from the source code:
Ridge Regression
The actual computation starts here (for the default settings); you can compare with equation (5) in the above linked document. The computation involves computing the dot product between feature vectors (the kernel), then the dual coefficients (alpha) and finally a dot product with the feature vectors in order to obtain the weights.
Kernel Ridge
Similarly computes the dual coefficients and stores them (instead of computing some weights). This is because when making predictions, again the kernel between training and prediction samples is computed. The result is then dotted with the dual coefficients.
The computation of the (training) kernel follows a similar procedure: compare Ridge and KernelRidge. The major difference is that Ridge explicitly considers the dot product between whatever (polynomial) features it has received while for KernelRidge these polynomial features are generated implicitly during the computation. For example consider a single feature x; with gamma = coef0 = 1 the KernelRidge computes (x**2 + 1)**2 == (x**4 + 2*x**2 + 1). If you consider now PolynomialFeatures this will provide features x**2, x, 1 and the corresponding dot product is x**4 + x**2 + 1. Hence the dot product differs by a term x**2. Of course we could rescale the poly-features to have x**2, sqrt(2)*x, 1 while with KernelRidge(kernel='poly') we don't have this kind of flexibility. On the other hand the difference probably doesn't matter (in most cases).
Note that also the computation of the dual coefficients is performed in a similar manner: Ridge and KernelRidge. Finally KernelRidge keeps the dual coefficients while Ridge directly computes the weights.
Let's see a small example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.extmath import safe_sparse_dot
np.random.seed(20181001)
a, b = 1, 4
x = np.linspace(0, 2, 100).reshape(-1, 1)
y = a*x**2 + b*x + np.random.normal(scale=0.2, size=(100,1))
poly = PolynomialFeatures(degree=2, include_bias=True)
xp = poly.fit_transform(x)
print('We can see that the new features are now [1, x, x**2]:')
print(f'xp.shape: {xp.shape}')
print(f'xp[-5:]:\n{xp[-5:]}', end='\n\n')
# Scale the `x` columns so we obtain similar results.
xp[:, 1] *= np.sqrt(2)
ridge = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge.fit(xp, y)
krr = KernelRidge(alpha=0, kernel='poly', degree=2, gamma=1, coef0=1)
krr.fit(x, y)
# Let's try to reproduce some of the involved steps for the different models.
ridge_K = safe_sparse_dot(xp, xp.T)
krr_K = krr._get_kernel(x)
print('The computed kernels are (alomst) similar:')
print(f'Max. kernel difference: {np.abs(ridge_K - krr_K).max()}', end='\n\n')
print('Predictions slightly differ though:')
print(f'Max. difference: {np.abs(krr.predict(x) - ridge.predict(xp)).max()}', end='\n\n')
# Let's see if the fit changes if we provide `x**2, x, 1` instead of `x**2, sqrt(2)*x, 1`.
xp_2 = xp.copy()
xp_2[:, 1] /= np.sqrt(2)
ridge_2 = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge_2.fit(xp_2, y)
print('Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:')
print(f'Max. difference: {np.abs(ridge_2.predict(xp_2) - ridge.predict(xp)).max()}', end='\n\n')
print('Interpretability of the coefficients changes though:')
print(f'ridge.coef_[1:]: {ridge.coef_[0, 1:]}, ridge_2.coef_[1:]: {ridge_2.coef_[0, 1:]}')
print(f'ridge.coef_[1]*sqrt(2): {ridge.coef_[0, 1]*np.sqrt(2)}')
print(f'Compare with: a, b = ({a}, {b})')
plt.plot(x.ravel(), y.ravel(), 'o', color='skyblue', label='Data')
plt.plot(x.ravel(), ridge.predict(xp).ravel(), '-', label='Ridge', lw=3)
plt.plot(x.ravel(), krr.predict(x).ravel(), '--', label='KRR', lw=3)
plt.grid()
plt.legend()
plt.show()
From which we obtain:
We can see that the new features are now [x, x**2]:
xp.shape: (100, 3)
xp[-5:]:
[[1. 1.91919192 3.68329762]
[1. 1.93939394 3.76124885]
[1. 1.95959596 3.84001632]
[1. 1.97979798 3.91960004]
[1. 2. 4. ]]
The computed kernels are (alomst) similar:
Max. kernel difference: 1.0658141036401503e-14
Predictions slightly differ though:
Max. difference: 0.04244651134471766
Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:
Max. difference: 7.15642822779472e-14
Interpretability of the coefficients changes though:
ridge.coef_[1:]: [2.73232239 1.08868872], ridge_2.coef_[1:]: [3.86408737 1.08868872]
ridge.coef_[1]*sqrt(2): 3.86408737392841
Compare with: a, b = (1, 4)
this is an example to show it:
from sklearn.datasets import make_friedman1
plt.figure()
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100,
n_features = 7, random_state=0)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)
X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
random_state = 0)
linreg = Ridge().fit(X_train, y_train)
print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
.format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))
(poly deg 2 + ridge) linear model coeff (w):
[ 0. 2.23 4.73 -3.15 3.86 1.61 -0.77 -0.15 -1.75 1.6 1.37 2.52
2.72 0.49 -1.94 -1.63 1.51 0.89 0.26 2.05 -1.93 3.62 -0.72 0.63
-3.16 1.29 3.55 1.73 0.94 -0.51 1.7 -1.98 1.81 -0.22 2.88 -0.89]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825
I assume you have known how the kernel ridge regression (KRR) and PolynomialFeatures + Ridge work. They are somewhat the same. I will list some mirror differences between them.
You can switch off the bias feature in PolynomialFeatures, and include it in the Ridge. The regularization term of Ridge doesn't include the bias. On the contrary, for KRR of sklearn, the penalty term always includes the bias term.
You can scale the features generated by PolynomialFeatures before you use Ridge. it's equal to customize the regularization strength for each polynomial feature. So PolynomialFeatures = Ridge is little more flexible. On the contrary, you have only two parameters to tune in the polynomial kernel, i.e. the gamma and the c_0, see polynomial kernel.
The fit and prediction time is different. You need to solve the system of linear equations K_NxN x=y$ in KRR. You need only to solve the system of linear equations A_Nx(D+1) x=y$, where N is the number of samples in training, and D the degree of the polynomial.
(This is a very very corner case) Kernel will be (almost) singular if two samples are (near) identical. And when alpha (regularization strength) is very small. you will meet the numerical stability problem. since the K + alpha*I is almost singular. You can only overcome this problem by using the Ridge. The reason why Ridge will work is explained in many machine learning textbooks.
I'm trying to use linear regression to fit a polynomium to a set of points from a sinusoidal signal with some noise added, using linear_model.LinearRegression from sklearn.
As expected, the training and validation scores increase as the degree of the polynomium increases, but after some degree around 20 things start getting weird and the scores start going down, and the model returns polynomiums that don't look at all like the data that I use to train it.
Below are some plots where this can be seen, as well as the code that generated both the regression models and the plots:
How the thing works well until degree=17. Original data VS predictions:
After that it just gets worse:
Validation curve, increasing the degree of the polynomium:
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve
def make_data(N, err=0.1, rseed=1):
rng = np.random.RandomState(1)
x = 10 * rng.rand(N)
X = x[:, None]
y = np.sin(x) + 0.1 * rng.randn(N)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression())
X, y = make_data(400)
X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)
plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');
degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree',
degree, cv=7)
plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o',
color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');
I know the expected thing is that the validation score goes down when the complexity of the model increases, but why does the training score goes down as well? What can I be missing here?
First of all, here is how you can fix it by setting normalization flag True for the model;
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(normalize=True))
But why? In linear regression fit() function finds best-fitting model with Moore–Penrose inverse which is a common way to compute least-square solution. When you add polynomials of the values, your augmented features become very large very quickly if you do not normalize. These large values dominate the cost computed by least-square and lead to a model fits to larger values i.e higher order polynomial values instead of the data.
Plots looks better and the way they are supposed to be.
Training score is expected to go down as well due to overfitting of the model on training data. Error on validation goes down due to sine function's Taylor series expansion. So, as you increase degree of polynomial, your model improves to fit the sine curve better.
In ideal scenario if you don't have a function that expands to infinite degrees, you see training error going down (not monotonically but in general) and validation error going up after some degree (high for lower degrees -> low for some higher degree -> increasing after that).
I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?
Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.
I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.
sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).
I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.