GPflow classification: interpretation of posterior variance

GPflow classification: interpretation of posterior variance - python

In the tutorial on multiclass classification on the GPflow website, a Sparse Variational Gaussian Process (SVGP) is used on a 1D toy example. As is the case for all other GPflow models, the SVGP model has a method predict_y(self, Xnew) which returns the mean and variance of held-out data at the points Xnew.
From the tutorial it is clear that the first argument that is unpacked from predict_y is the posterior predictive probability of each of the three classes (cells [7] and [8]), shown as the colored lines in the second panel of the plot below. However, the authors do not elaborate on the second argument that can be unpacked from predict_y, which are the variances of the predictions. In a regression setting its interpretation is clear to me, as the posterior predictive distribution in this case would be a Gaussian.
But I fail to understand what could be the interpretation here. Especially, I would like to know how this measure could be used to construct error bars denoting uncertainty around class predictions for any new data point.
I altered the code of the tutorial slightly to add an additional panel to the plot below: the third panel shows in black the maximal standard deviation (square root of the obtained variance from predict_y). It clearly is a good measure for uncertainty and it is probably also no coincidence that the highest possible value is 0.5, but I could not find how it is calculated and what it represents.
Complete notebook with all code here.
def plot(m):
f = plt.figure(figsize=(12,8))
a1 = f.add_axes([0.05, 0.05, 0.9, 0.5])
av = f.add_axes([0.05, 0.6, 0.9, 0.1])
a2 = f.add_axes([0.05, 0.75, 0.9, 0.1])
a3 = f.add_axes([0.05, 0.9, 0.9, 0.1])
xx = np.linspace(m.X.read_value().min()-0.3, m.X.read_value().max()+0.3, 200).reshape(-1,1)
mu, var = m.predict_f(xx)
mu, var = mu.copy(), var.copy()
p, v = m.predict_y(xx)
a3.set_xticks([])
a3.set_yticks([])
av.set_xticks([])
lty = ['-', '--', ':']
for i in range(m.likelihood.num_classes):
x = m.X.read_value()[m.Y.read_value().flatten()==i]
points, = a3.plot(x, x*0, '.')
color=points.get_color()
a1.fill_between(xx[:,0], mu[:,i] + 2*np.sqrt(var[:,i]), mu[:,i] - 2*np.sqrt(var[:,i]), alpha = 0.2)
a1.plot(xx, mu[:,i], color=color, lw=2)
a2.plot(xx, p[:,i], '-', color=color, lw=2)
av.plot(xx, np.sqrt(np.max(v[:,:], axis = 1)), c = "black", lw=2)
for ax in [a1, av, a2, a3]:
ax.set_xlim(xx.min(), xx.max())
a2.set_ylim(-0.1, 1.1)
a2.set_yticks([0, 1])
a2.set_xticks([])
plot(m)

Model.predict_y() calls Likelihood.predict_mean_and_var(). If you look at the documentation of the latter function [1] you see that all it does, is compute the mean and variance of the predictive distribution. I.e., we first compute the marginal predictive distribution q(y) = \int p(y|f) q(f) df, and then we compute the mean and variance of q(y).
For a Gaussian, the mean and variance can be specified independently of each other, and they have interpretations as a point prediction and the uncertainty. For a Bernoulli likelihood, the mean and variance are both completely determined by the single parameter p. The mean of the distribution is the probability of the event, which already tells us the uncertainty! The variance doesn't give much more.
However, you are right that the variance is a nice metric of uncertainty where higher means more uncertainty. The entropy as a function of p looks very similar (although the two differ in behaviour near the edges):
p = np.linspace(0.001, 1 - 0.001, 1000)[:, None]
q = 1 - p
plt.plot(p, -p * np.log(p) - q * np.log(q), label='entropy')
plt.plot(p, p * q, label='variance')
plt.legend()
plt.xlabel('probability')
[1] https://github.com/GPflow/GPflow/blob/b8ed8332549a375da8658a1117470ac86d823e7f/gpflow/likelihoods.py#L76

Related

How to get the unscaled regression coefficients errors using statsmodels?

I'm trying to compute the coefficient errors of a regression using statsmodels. Also known as the standard errors of the parameter estimates. But I need to compute their "unscaled" version. I've only managed to do so with NumPy.
You can see the meaning of "unscaled" in the docs: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
cov bool or str, optional
If given and not False, return not just the estimate but also its covariance matrix.
By default, the covariance are scaled by chi2/dof, where dof = M - (deg + 1),
i.e., the weights are presumed to be unreliable except in a relative sense and
everything is scaled such that the reduced chi2 is unity. This scaling is omitted
if cov='unscaled', as is relevant for the case that the weights are w = 1/sigma, with
sigma known to be a reliable estimate of the uncertainty.
I'm using this data to run the rest of the code in this post:
import numpy as np
x = np.array([-0.841, -0.399, 0.599, 0.203, 0.527, 0.129, 0.703, 0.503])
y = np.array([1.01, 1.24, 1.09, 0.95, 1.02, 0.97, 1.01, 0.98])
sigmas = np.array([6872.26, 80.71, 47.97, 699.94, 57.55, 1561.54, 311.98, 501.08])
# The convention for weights are different
sm_weights = np.array([1.0/sigma**2 for sigma in sigmas])
np_weights = np.array([1.0/sigma for sigma in sigmas])
With NumPy:
coefficients, cov = np.polyfit(x, y, deg=2, w=np_weights, cov='unscaled')
# The errors I need to get
print(np.sqrt(np.diag(cov))) # [917.57938013 191.2100413 211.29028248]
If I compute the regression using statsmodels:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as smapi
polynomial_features = PolynomialFeatures(degree=2)
polynomial = polynomial_features.fit_transform(x.reshape(-1, 1))
model = smapi.WLS(y, polynomial, weights=sm_weights)
regression = model.fit()
# Get coefficient errors
# Notice the [::-1], statsmodels returns the coefficients in the reverse order NumPy does
print(regression.bse[::-1]) # [0.24532856, 0.05112286, 0.05649161]
So the values I get are different, but related:
np_errors = np.sqrt(np.diag(cov))
sm_errors = regression.bse[::-1]
print(np_errors / sm_errors) # [3740.2061481, 3740.2061481, 3740.2061481]
The NumPy documentation says the covariance are scaled by chi2/dof where dof = M - (deg + 1). So I tried the following:
degree = 2
model_predictions = np.polyval(coefficients, x)
residuals = (model_predictions - y)
chi_squared = np.sum(residuals**2)
degrees_of_freedom = len(x) - (degree + 1)
scale_factor = chi_squared / degrees_of_freedom
sm_cov = regression.cov_params()
unscaled_errors = np.sqrt(np.diag(sm_cov * scale_factor))[::-1] # [0.09848423, 0.02052266, 0.02267789]
unscaled_errors = np.sqrt(np.diag(sm_cov / scale_factor))[::-1] # [0.61112427, 0.12734931, 0.14072311]
What I notice is that the covariance matrix I get from NumPy is much larger than the one I get from statsmodels:
>>> cov
array([[ 841951.9188366 , -154385.61049538, -188456.18957375],
[-154385.61049538, 36561.27989418, 31208.76422516],
[-188456.18957375, 31208.76422516, 44643.58346933]])
>>> regression.cov_params()
array([[ 0.0031913 , 0.00223093, -0.0134716 ],
[ 0.00223093, 0.00261355, -0.0110361 ],
[-0.0134716 , -0.0110361 , 0.0601861 ]])
As long as I can't make them equivalent, I won't be able to get the same errors. Any idea of what the difference in scale could mean and how to make both covariance matrices equal?

statsmodels documentation is not well organized in some parts.
Here is a notebook with an example for the following
https://www.statsmodels.org/devel/examples/notebooks/generated/chi2_fitting.html
The regression models in statsmodels like OLS and WLS, have an option to keep the scale fixed. This is the equivalent to cov="unscaled" in numpy and scipy.
The statsmodels option is more general, because it allows fixing the scale at any user defined value.
https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html
We we have a model as defined in the example, either OLS or WLS, then using
regression = model.fit(cov_type="fixed scale")
will keep the scale at 1 and the resulting covariance matrix is unscaled.
Using
regression = model.fit(cov_type="fixed scale", cov_kwds={"scale": 2})
will keep the scale fixed at value two.
(some links to related discussion motivation are in https://github.com/statsmodels/statsmodels/pull/2137 )
Caution
The fixed scale cov_type will be used for inferential statistic that are based on the covariance of the parameter estimates, cov_params.
This affects standard errors, t-tests, wald tests and confidence and prediction intervals.
However, some other results statistics might not be adjusted to use the fixed scale instead of the estimated scale, e.g. resid_pearson.
https://github.com/statsmodels/statsmodels/issues/8190

Better fitting for a power-law curve

So I had some points in a dataframe that led me to believe I was dealing with a power law curve. After some googling, I used what I found in this post to go about curve fitting.
def func_powerlaw(x, m, c, c0):
return c0 + x**m * c
target_func = func_powerlaw
X = np.array(selection_to_feed.selection[1:])
y = np.array(selection_to_feed.avg_feed_size[1:])
popt, pcov = curve_fit(func_powerlaw, X, y, p0 =np.asarray([-1,10**5,0]))
curvex = np.linspace(0,5000,1000)
curvey = target_func(curvex, *popt)
plt.figure(figsize=(10, 5))
plt.plot(curvex, curvey, '--')
plt.plot(X, y, 'ro')
plt.legend()
plt.show()
This is the result:
Curve
The problem is, the curve fit results in negative values for the first few values (as you can see in the blue line), and in the actual relationship, no negative Y values can exist.
A few questions:
What can I do make sure no negative Y values can be output? Really, an X of 0 should have a Y value of 0 as well.
Is power law curve fitting even the right thing to do? How would you describe this curve?
Thank you!

If you are only looking for a simple approximating equation with a better fit, I extracted data from your plot and added the known data point [0,0] per your note. Since the uncertainty for the [0,0] point is zero - that is, you are 100% certain of that value - I used a weighted regression where that one known point was given an extremely high weight and the weight for all other points was 1. This had the effect of forcing the curve through the [0,0] point, which can be done with any software that allows weighted fitting. I found that a Standard Geometric plus offset equation, "y = a * pow(x, (b * x)) + offset", with parameters:
a = -1.0704001788540748E+02
b = -1.5095055897637395E-03
Offset = 1.0704001788540748E+02
fits as shown in the attached plot and passes through [0,0]. My suggestion is to perform a regression using this equation with the actual data plus the known [0,0] point, using these values as the initial parameter estimates - and if possible using a very large weight for the [0,0] point as I did.

Probability density function numpy histogram/scipy stats

We have the arraya=range(10). Using numpy.histogram:
hist,bins=numpy.histogram(a,bins=(np.max(a)-np.min(a))/1, range=np.min(a),np.max(a)),density=True)
According to numpy tutorial:
If density=True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1.
The result is:
array([ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2])
I try to do the same using scipy.stats:
mean = np.mean(a)
sigma = np.std(a)
norm.pdf(a, mean, sigma)
However the result is different:
array([ 0.04070852, 0.06610774, 0.09509936, 0.12118842, 0.13680528,0.13680528, 0.12118842, 0.09509936, 0.06610774, 0.04070852])
I want to know why.
Update:I would like to set a more general question. How can we have the probability density function of an array without using numpy.histogram for density=True ?

If density=True, the result is the value of the probability density
function at the bin, normalized such that the integral over the
range is 1.
The "normalized" there does not mean that it will be transformed using a Normal Distribution. It simply says that each value in the bin will be divided by the total number of entries so that the total density would be equal to 1.

You can't compare numpy.histogram() and scipy.stats.norm() for this sample reason:
scipy.stats.norm() is A normal continuous random variable while numpy.histogram() deal with sequences (discontinuous)

Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html) Else you can use the code below.
n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

numpy.polyfit versus scipy.odr

I have a data set which in theory is described by a polynomial of the second degree. I would like to fit this data and I have used numpy.polyfit to do this. However, the down side is that the error on the returned coefficients is not available. Therefore I decided to also fit the data using scipy.odr. The weird thing was that the coefficients for the polynomial deviated from each other.
I do not understand this and therefore decided to test both fitting routines on a set of data that I produce my self:
import numpy
import scipy.odr
import matplotlib.pyplot as plt
x = numpy.arange(-20, 20, 0.1)
y = 1.8 * x**2 -2.1 * x + 0.6 + numpy.random.normal(scale = 100, size = len(x))
#Define function for scipy.odr
def fit_func(p, t):
return p[0] * t**2 + p[1] * t + p[2]
#Fit the data using numpy.polyfit
fit_np = numpy.polyfit(x, y, 2)
#Fit the data using scipy.odr
Model = scipy.odr.Model(fit_func)
Data = scipy.odr.RealData(x, y)
Odr = scipy.odr.ODR(Data, Model, [1.5, -2, 1], maxit = 10000)
output = Odr.run()
#output.pprint()
beta = output.beta
betastd = output.sd_beta
print "poly", fit_np
print "ODR", beta
plt.plot(x, y, "bo")
plt.plot(x, numpy.polyval(fit_np, x), "r--", lw = 2)
plt.plot(x, fit_func(beta, x), "g--", lw = 2)
plt.tight_layout()
plt.show()
An example of an outcome is as follows:
poly [ 1.77992643 -2.42753714 3.86331152]
ODR [ 3.8161735 -23.08952492 -146.76214989]
In the included image, the solution of numpy.polyfit (red dashed line) corresponds pretty well. The solution of scipy.odr (green dashed line) is basically completely off. I do have to note that the difference between numpy.polyfit and scipy.odr was less in the actual data set I wanted to fit. However, I do not understand where the difference between the two comes from, why in my own testing example the difference is extremely big, and which fitting routine is better?
I hope you can provide answers that might help me give a better understanding between the two fitting routines and in the process provide answers to the questions I have.

In the way you are using ODR it does a full orthogonal distance regression. To have it do a normal nonlinear least squares fit add
Odr.set_job(fit_type=2)
before starting the optimization and you will get what you expected.
The reason that the full ODR fails so badly is due to not specifying weights/standard deviations. Obviously it does hard to interpret that point cloud and assumes equal wheights for x and y. If you provide estimated standard deviations, odr will yield a good (though different of course) result, too.
Data = scipy.odr.RealData(x, y, sx=0.1, sy=10)

The actual problem is that the odr output has the beta coefficients in the opposite order than numpy.polyfit has. So the green curve is not calculated correctly. To plot it, use instead
plt.plot(x, fit_func(beta[::-1], x), "g--", lw = 2)

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?

Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.

I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.

sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).

I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.