Kernel ridge and simple Ridge with Polynomial features - python

What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)?

The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original features. This document provides a good overview in general.
Regarding the computation we can inspect the relevant parts from the source code:
Ridge Regression
The actual computation starts here (for the default settings); you can compare with equation (5) in the above linked document. The computation involves computing the dot product between feature vectors (the kernel), then the dual coefficients (alpha) and finally a dot product with the feature vectors in order to obtain the weights.
Kernel Ridge
Similarly computes the dual coefficients and stores them (instead of computing some weights). This is because when making predictions, again the kernel between training and prediction samples is computed. The result is then dotted with the dual coefficients.
The computation of the (training) kernel follows a similar procedure: compare Ridge and KernelRidge. The major difference is that Ridge explicitly considers the dot product between whatever (polynomial) features it has received while for KernelRidge these polynomial features are generated implicitly during the computation. For example consider a single feature x; with gamma = coef0 = 1 the KernelRidge computes (x**2 + 1)**2 == (x**4 + 2*x**2 + 1). If you consider now PolynomialFeatures this will provide features x**2, x, 1 and the corresponding dot product is x**4 + x**2 + 1. Hence the dot product differs by a term x**2. Of course we could rescale the poly-features to have x**2, sqrt(2)*x, 1 while with KernelRidge(kernel='poly') we don't have this kind of flexibility. On the other hand the difference probably doesn't matter (in most cases).
Note that also the computation of the dual coefficients is performed in a similar manner: Ridge and KernelRidge. Finally KernelRidge keeps the dual coefficients while Ridge directly computes the weights.
Let's see a small example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.extmath import safe_sparse_dot
a, b = 1, 4
x = np.linspace(0, 2, 100).reshape(-1, 1)
y = a*x**2 + b*x + np.random.normal(scale=0.2, size=(100,1))
poly = PolynomialFeatures(degree=2, include_bias=True)
xp = poly.fit_transform(x)
print('We can see that the new features are now [1, x, x**2]:')
print(f'xp.shape: {xp.shape}')
print(f'xp[-5:]:\n{xp[-5:]}', end='\n\n')
# Scale the `x` columns so we obtain similar results.
xp[:, 1] *= np.sqrt(2)
ridge = Ridge(alpha=0, fit_intercept=False, solver='cholesky'), y)
krr = KernelRidge(alpha=0, kernel='poly', degree=2, gamma=1, coef0=1), y)
# Let's try to reproduce some of the involved steps for the different models.
ridge_K = safe_sparse_dot(xp, xp.T)
krr_K = krr._get_kernel(x)
print('The computed kernels are (alomst) similar:')
print(f'Max. kernel difference: {np.abs(ridge_K - krr_K).max()}', end='\n\n')
print('Predictions slightly differ though:')
print(f'Max. difference: {np.abs(krr.predict(x) - ridge.predict(xp)).max()}', end='\n\n')
# Let's see if the fit changes if we provide `x**2, x, 1` instead of `x**2, sqrt(2)*x, 1`.
xp_2 = xp.copy()
xp_2[:, 1] /= np.sqrt(2)
ridge_2 = Ridge(alpha=0, fit_intercept=False, solver='cholesky'), y)
print('Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:')
print(f'Max. difference: {np.abs(ridge_2.predict(xp_2) - ridge.predict(xp)).max()}', end='\n\n')
print('Interpretability of the coefficients changes though:')
print(f'ridge.coef_[1:]: {ridge.coef_[0, 1:]}, ridge_2.coef_[1:]: {ridge_2.coef_[0, 1:]}')
print(f'ridge.coef_[1]*sqrt(2): {ridge.coef_[0, 1]*np.sqrt(2)}')
print(f'Compare with: a, b = ({a}, {b})')
plt.plot(x.ravel(), y.ravel(), 'o', color='skyblue', label='Data')
plt.plot(x.ravel(), ridge.predict(xp).ravel(), '-', label='Ridge', lw=3)
plt.plot(x.ravel(), krr.predict(x).ravel(), '--', label='KRR', lw=3)
From which we obtain:
We can see that the new features are now [x, x**2]:
xp.shape: (100, 3)
[[1. 1.91919192 3.68329762]
[1. 1.93939394 3.76124885]
[1. 1.95959596 3.84001632]
[1. 1.97979798 3.91960004]
[1. 2. 4. ]]
The computed kernels are (alomst) similar:
Max. kernel difference: 1.0658141036401503e-14
Predictions slightly differ though:
Max. difference: 0.04244651134471766
Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:
Max. difference: 7.15642822779472e-14
Interpretability of the coefficients changes though:
ridge.coef_[1:]: [2.73232239 1.08868872], ridge_2.coef_[1:]: [3.86408737 1.08868872]
ridge.coef_[1]*sqrt(2): 3.86408737392841
Compare with: a, b = (1, 4)

this is an example to show it:
from sklearn.datasets import make_friedman1
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100,
n_features = 7, random_state=0)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)
X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
random_state = 0)
linreg = Ridge().fit(X_train, y_train)
print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))
(poly deg 2 + ridge) linear model coeff (w):
[ 0. 2.23 4.73 -3.15 3.86 1.61 -0.77 -0.15 -1.75 1.6 1.37 2.52
2.72 0.49 -1.94 -1.63 1.51 0.89 0.26 2.05 -1.93 3.62 -0.72 0.63
-3.16 1.29 3.55 1.73 0.94 -0.51 1.7 -1.98 1.81 -0.22 2.88 -0.89]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825

I assume you have known how the kernel ridge regression (KRR) and PolynomialFeatures + Ridge work. They are somewhat the same. I will list some mirror differences between them.
You can switch off the bias feature in PolynomialFeatures, and include it in the Ridge. The regularization term of Ridge doesn't include the bias. On the contrary, for KRR of sklearn, the penalty term always includes the bias term.
You can scale the features generated by PolynomialFeatures before you use Ridge. it's equal to customize the regularization strength for each polynomial feature. So PolynomialFeatures = Ridge is little more flexible. On the contrary, you have only two parameters to tune in the polynomial kernel, i.e. the gamma and the c_0, see polynomial kernel.
The fit and prediction time is different. You need to solve the system of linear equations K_NxN x=y$ in KRR. You need only to solve the system of linear equations A_Nx(D+1) x=y$, where N is the number of samples in training, and D the degree of the polynomial.
(This is a very very corner case) Kernel will be (almost) singular if two samples are (near) identical. And when alpha (regularization strength) is very small. you will meet the numerical stability problem. since the K + alpha*I is almost singular. You can only overcome this problem by using the Ridge. The reason why Ridge will work is explained in many machine learning textbooks.


How to get the unscaled regression coefficients errors using statsmodels?

I'm trying to compute the coefficient errors of a regression using statsmodels. Also known as the standard errors of the parameter estimates. But I need to compute their "unscaled" version. I've only managed to do so with NumPy.
You can see the meaning of "unscaled" in the docs:
cov bool or str, optional
If given and not False, return not just the estimate but also its covariance matrix.
By default, the covariance are scaled by chi2/dof, where dof = M - (deg + 1),
i.e., the weights are presumed to be unreliable except in a relative sense and
everything is scaled such that the reduced chi2 is unity. This scaling is omitted
if cov='unscaled', as is relevant for the case that the weights are w = 1/sigma, with
sigma known to be a reliable estimate of the uncertainty.
I'm using this data to run the rest of the code in this post:
import numpy as np
x = np.array([-0.841, -0.399, 0.599, 0.203, 0.527, 0.129, 0.703, 0.503])
y = np.array([1.01, 1.24, 1.09, 0.95, 1.02, 0.97, 1.01, 0.98])
sigmas = np.array([6872.26, 80.71, 47.97, 699.94, 57.55, 1561.54, 311.98, 501.08])
# The convention for weights are different
sm_weights = np.array([1.0/sigma**2 for sigma in sigmas])
np_weights = np.array([1.0/sigma for sigma in sigmas])
With NumPy:
coefficients, cov = np.polyfit(x, y, deg=2, w=np_weights, cov='unscaled')
# The errors I need to get
print(np.sqrt(np.diag(cov))) # [917.57938013 191.2100413 211.29028248]
If I compute the regression using statsmodels:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as smapi
polynomial_features = PolynomialFeatures(degree=2)
polynomial = polynomial_features.fit_transform(x.reshape(-1, 1))
model = smapi.WLS(y, polynomial, weights=sm_weights)
regression =
# Get coefficient errors
# Notice the [::-1], statsmodels returns the coefficients in the reverse order NumPy does
print(regression.bse[::-1]) # [0.24532856, 0.05112286, 0.05649161]
So the values I get are different, but related:
np_errors = np.sqrt(np.diag(cov))
sm_errors = regression.bse[::-1]
print(np_errors / sm_errors) # [3740.2061481, 3740.2061481, 3740.2061481]
The NumPy documentation says the covariance are scaled by chi2/dof where dof = M - (deg + 1). So I tried the following:
degree = 2
model_predictions = np.polyval(coefficients, x)
residuals = (model_predictions - y)
chi_squared = np.sum(residuals**2)
degrees_of_freedom = len(x) - (degree + 1)
scale_factor = chi_squared / degrees_of_freedom
sm_cov = regression.cov_params()
unscaled_errors = np.sqrt(np.diag(sm_cov * scale_factor))[::-1] # [0.09848423, 0.02052266, 0.02267789]
unscaled_errors = np.sqrt(np.diag(sm_cov / scale_factor))[::-1] # [0.61112427, 0.12734931, 0.14072311]
What I notice is that the covariance matrix I get from NumPy is much larger than the one I get from statsmodels:
>>> cov
array([[ 841951.9188366 , -154385.61049538, -188456.18957375],
[-154385.61049538, 36561.27989418, 31208.76422516],
[-188456.18957375, 31208.76422516, 44643.58346933]])
>>> regression.cov_params()
array([[ 0.0031913 , 0.00223093, -0.0134716 ],
[ 0.00223093, 0.00261355, -0.0110361 ],
[-0.0134716 , -0.0110361 , 0.0601861 ]])
As long as I can't make them equivalent, I won't be able to get the same errors. Any idea of what the difference in scale could mean and how to make both covariance matrices equal?
statsmodels documentation is not well organized in some parts.
Here is a notebook with an example for the following
The regression models in statsmodels like OLS and WLS, have an option to keep the scale fixed. This is the equivalent to cov="unscaled" in numpy and scipy.
The statsmodels option is more general, because it allows fixing the scale at any user defined value.
We we have a model as defined in the example, either OLS or WLS, then using
regression ="fixed scale")
will keep the scale at 1 and the resulting covariance matrix is unscaled.
regression ="fixed scale", cov_kwds={"scale": 2})
will keep the scale fixed at value two.
(some links to related discussion motivation are in )
The fixed scale cov_type will be used for inferential statistic that are based on the covariance of the parameter estimates, cov_params.
This affects standard errors, t-tests, wald tests and confidence and prediction intervals.
However, some other results statistics might not be adjusted to use the fixed scale instead of the estimated scale, e.g. resid_pearson.

Python - Multiple Linear Regression - Coefficient of Determination for each Input Variable

I am performing a fairly straight forward multiple linear regression in Python using sklearn. See code snippet below - full_results is a dataframe in which all variables are numeric.
The results of this code is a single coefficient of determination which I believe denotes how much change in y is due to the combination of x1 - x4.
My question is whether the coefficient of determination can be split out between the 4 input variables, so I can see how much change in y is attributed to each variable individually.
I can of course run a single variable linear regression for each variable independently, but this doesn't feel like the right solution.
I have a memory of being in stats class many years ago and doing something similar in R.
from sklearn.linear_model import LinearRegression
x = full_results[['x1','x2','x3','x4']].values
y = full_results['y'].values
mlr = LinearRegression(), y)
mlr.score(x, y)
The coefficient of determination is the proportion of total variance explained. So another way of looking at it is to see the proportion of variance explained by each term, also explained here. For this we use an anova to calculate the sum of squares for each term.
One thing you have to take note is that this works if your predictors are not correlated. If they are, then the order in each they are specified in the model would make a difference in the calculation.
Using an example dataset:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import pandas as pd
X,y = make_regression(n_samples=100, n_features=4,
n_informative=3, noise=20, random_state=99)
df = pd.DataFrame(X,columns = ['x1','x2','x3','x4'])
df['y'] = y
mlr = LinearRegression()[['x1','x2','x3','x4']], y)
array([ 8.33369861, 29.1717497 , 26.6294007 , -1.82445836])
mlr.score(df[['x1','x2','x3','x4']], y)
It's easier to calculate this with statsmodels and make a linear fit, you can see the coefficients will be pretty similar:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
lm = ols('y ~ x1 + x2 + x3 + x4',df).fit()
Intercept -0.740399
x1 8.333699
x2 29.171750
x3 26.629401
x4 -1.824458
We get the anova:
anova_table = anova_lm(lm)
df sum_sq mean_sq F PR(>F)
x1 1.0 10394.554366 10394.554366 28.605241 6.110239e-07
x2 1.0 113541.846572 113541.846572 312.460911 8.531356e-32
x3 1.0 66267.787822 66267.787822 182.365304 7.899193e-24
x4 1.0 298.584632 298.584632 0.821688 3.669804e-01
Residual 95.0 34521.039456 363.379363 NaN NaN
Everything except the residuals in sum square column gives you r-squared similar to that from sklearn:
anova_table['sum_sq'][:-1].sum() / anova_table['sum_sq'].sum()
Now the proportion of variance explained (we seldom call it r-squared) for example 'x1' is:
anova_table.loc['x1','sum_sq'] / anova_table['sum_sq'].sum()

Comparing computational and analytic results of linear regression

Consider simple one feature linear regression. x = features, w = weights
We have w for the best fit to the linear regression model as,
w = (xTx)^(-1)xTy
Now I am comparing results I got from scikit learn regressor and computational w method and they have significant difference among them.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Salary_Data.csv')
x = data.iloc[:,[0]].values
y = data.iloc[:,[1]].values
x_t = np.transpose(x)
first_inv = np.matmul(x_t, x)
second = np.matmul(x_t, y)
first = np.linalg.inv(first_inv)
theta = np.matmul(first, second)
y_prad = theta*x
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(), y)
y_prad2 = regressor.predict(x)
plt.scatter(x, y)
plt.plot(x, y_prad , 'red')
plt.plot(x, y_prad2, 'green')
Where am I wrong here?(whatever in concepts or code)
You are forgetting the intercept term. Add a column of ones to the x matrix using
np.insert(x, 0, 1, axis=1)
and then re-run the calculations. The shape of x should be (30, 2) where the first column is all 1's to represent the constant multiplied by the intercept. The final shape of theta should be (2, 1) where the first term is the intercept and the second is the slope.
Here is a good reference for matrix formulation of linear regression.
Matrix Formulation of Linear Regression

what does the option normalize = True in Lasso sklearn do?

I have a matrix where each column has mean 0 and std 1
In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922
In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007
In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16
In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17
The number of non 0 coefficients changes if I use the normalize option
In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)
In [81]: sum(l.coef_!=0)
Out[83]: 47
In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)
In [93]: sum(l2.coef_!=0)
Out[95]: 3
It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.
So what does normalize=T actually do?
This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data: If normalize=True, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True will be deprecated from sklearn version 0.17.
Solution: Do not use standardize=True. Instead, build a sklearn.pipeline.Pipeline and prepend a sklearn.preprocessing.StandardScaler to your Lasso object. That way you don't even need to perform your initial scaling.
Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs( / n_samples (for normalize=False).
[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)
[Stop reading here if you don't want the details]
Here is some copy and pasteable code reproducing the problem
import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))
beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.
y =
print X.std(0)
print X.mean(0)
from sklearn.linear_model import Lasso
lasso1 = Lasso(alpha=.1)
print, y).coef_
lasso2 = Lasso(alpha=.1, normalize=True)
print, y).coef_
In order to understand what is going on, now observe that / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)
is equal to, y).coef_
Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples) converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True with its penalty scaled down by np.sqrt(n_samples) acts like a lasso estimator with normalize=False (on your type of data, i.e. already standardized to std=1).
lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print, y).coef_ # yields the same coefficients as, y).coef_
I think the top answer is wrong...
In Lasso, if you set normalize=True, every column will be divided by its L2 norm (i.e., sd*sqrt(n)) before fitting a lasso regression. The magnitude of design matrix is thus reduced, and the "expected" coefficients will be enlarged. The larger the coefficients, the stronger the L1 penalty. So the function has to pay more attention to L1 penalty, and make more features to be 0. You will see more sparse features (β=0) as a result.

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?
Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
title='x vs y scatter',
# weighted prediction
wp, = ax.plot(
# unweighted prediction
op, = ax.plot(
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fig.set_size_inches(6.40, 5.12)
WLS residuals:
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.
I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.
sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression(), y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).
I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

