Train score diminishes after polynomial regression degree increases

Train score diminishes after polynomial regression degree increases - python

I'm trying to use linear regression to fit a polynomium to a set of points from a sinusoidal signal with some noise added, using linear_model.LinearRegression from sklearn.
As expected, the training and validation scores increase as the degree of the polynomium increases, but after some degree around 20 things start getting weird and the scores start going down, and the model returns polynomiums that don't look at all like the data that I use to train it.
Below are some plots where this can be seen, as well as the code that generated both the regression models and the plots:
How the thing works well until degree=17. Original data VS predictions:
After that it just gets worse:
Validation curve, increasing the degree of the polynomium:
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve
def make_data(N, err=0.1, rseed=1):
rng = np.random.RandomState(1)
x = 10 * rng.rand(N)
X = x[:, None]
y = np.sin(x) + 0.1 * rng.randn(N)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression())
X, y = make_data(400)
X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)
plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');
degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree',
degree, cv=7)
plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o',
color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');
I know the expected thing is that the validation score goes down when the complexity of the model increases, but why does the training score goes down as well? What can I be missing here?

First of all, here is how you can fix it by setting normalization flag True for the model;
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(normalize=True))
But why? In linear regression fit() function finds best-fitting model with Moore–Penrose inverse which is a common way to compute least-square solution. When you add polynomials of the values, your augmented features become very large very quickly if you do not normalize. These large values dominate the cost computed by least-square and lead to a model fits to larger values i.e higher order polynomial values instead of the data.
Plots looks better and the way they are supposed to be.

Training score is expected to go down as well due to overfitting of the model on training data. Error on validation goes down due to sine function's Taylor series expansion. So, as you increase degree of polynomial, your model improves to fit the sine curve better.
In ideal scenario if you don't have a function that expands to infinite degrees, you see training error going down (not monotonically but in general) and validation error going up after some degree (high for lower degrees -> low for some higher degree -> increasing after that).

Related

Questions about ridge regression on python : Scaling, and interpretation

I tried to run a Ridge Regression on Boston housing data with python, but I had the following questions that I cannot find answer anywhere so I decided to post it here:
Is scaling recommended before fitting the model? Because I get the same score when I scale and when I don't scale. Also, what is the interpretation of the alpha/coeff graph in terms of choosing the best alpha?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
df = pd.read_csv('../housing.data',delim_whitespace=True,header=None)
col_names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df.columns = col_names
X = df.loc[:,df.columns!='MEDV']
col_X = X.columns
y = df['MEDV'].values
# Feature Scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
clf = Ridge()
coefs = []
alphas = np.logspace(-6, 6, 200)
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X_std, y)
coefs.append(clf.coef_)
plt.figure(figsize=(20, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Alpha/coefficient graph for scaled X
Alpha/coefficient graph for unscaled X
On the scaled data, when I compute the score and choose the alpha thanks to CV, I get:
from sklearn.linear_model import RidgeCV
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 7]).fit(X_std, y)
> clf.score(X_std, y)
> 0.74038
> clf.alpha_
> 5.0
On the non-scaled data, I even get a slightly better score with a completely different alpha:
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 6]).fit(X, y)
> clf.score(X, y)
> 0.74064
> clf.alpha_
> 0.01
Thanks for your insights on the matter, looking forward to reading your answers!

I think you should scale because Ridge Regularization penalizes large values, and so you don't want to lose meaningful features because of scaling issues. Perhaps you don't see a difference because the housing data is a toy dataset and is already scaled well.
A larger alpha is a stronger penalty on large values. The graph is showing you (though it has no labeling) that with a stronger alpha you send coefficients to zero more strongly. The more gradual lines are the smaller weights, so they're effected less or almost not at all until alpha becomes sufficiently large. The sharper ones are larger weights, so they drop to zero more quickly. When they do, the feature will disappear from your regression.

For the scaled data, the magnitude of design matrix is smaller, and the coefficients tend to be larger (and more L2 penalty is imposed). To minimize L2, we need more and more small coefficients. How to get more and more small coefficients? The way is to choose a very big alpha, so we can have more smaller coefficients. That is why if you scale the data, the optimal alpha is a great number.

Numpy: how to generate a random noisy curve resembling a "training curve"

I'd like to know how I can generate some random data whose plot resembles a "training curve." By training curve, I mean an array of training loss values from a learning model. These typically have larger values and variance at the beginning, and over time converge to some value with very little variance. It looks a bit like a noisy exponential curve.
This is the closest I've gotten to making random data that resembles a training curve. The problems are that the curve does not flatten out or converge like true loss curves, and there is too much variance on the flatter part.
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
And here is an actual training loss curve for reference.
I do not know enough about probability distributions to know if this is a stupid question. I only wanted to generate a random curve so that others had a data array to work with to help me with another question I have about logarithmic plots in matplotlib.

Here is the illustration how to do it with gamma distribution for the noise
x = np.arange(2000)
y = 0.00025 + 0.001 * np.exp(-x/100.) + scipy.stats.gamma(3).rvs(len(x))*(1-np.exp(-x/100))*2e-5
You can adjust the parameters here, to reduce the amount of noise etc

Seems like you could add a dampener to the noise value that is proportional to how far along the x axis that given value is. This would mean, in this case, the variance would decrease the flatter the curve got. Something like:
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
index = 0
for noise_value in np.nditer(noise):
noise[index] = noise_value - index
index = index + 1
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
Thus I think the noise values should be lower the further along X you go and it should achieve the result you want!

Fit one-dimensional data with scikit-learn to predict line

I wrote code with scikit-learn to build a SVR prediction model for one-dimensional toy data and then plot it with matplotlib.
The blue line is the true data. The model with the linear kernel fits a nice line, but for the kernel of degree 2, the predictions are not what I would expect. I would like to have a model that would predict the values of the blue line slightly below what the orange line is predicting. I painted a black line to visualize what I had in mind.
Why is this happening? The data seems a good candidate for a polynomial of degree 2. The black trend line following the true data and then decreasing much later on the right should result in a much better fit than what the green line is providing, if I just look at this plot. Shouldn't such a model be found with a polynomial of degree 2 based on the data? It would also curve nicely at X = 0 close to the blue line, instead of having this curvature with a higher estimated y value there.
How can I achieve a model that I want?
The code below is complete and self contained, run it to get the plot above (minus the painted black line)
# some data
y = [0, 3642, 6414, 9844, 13210, 16072, 18868, 22275, 25551, 28949, 31680, 34412, 37290, 39858, 42557,
45094, 47354, 49547, 51874, 54534, 55987, 55987, 58377, 60767, 63109, 65060, 66865, 68540, 70328,
72035, 73905, 75791, 77873, 79791, 81775, 83726]
X = range(0, len(y))
X_longer = range(0, len(y)*2)
# train models
from sklearn.svm import SVR
import numpy as np
clf_1 = SVR(kernel='poly', C=1e3, degree=1)
clf_2 = SVR(kernel='poly', C=1e3, degree=2)
clf_1.fit(np.array(X).reshape(-1, 1), y)
clf_2.fit(np.array(X).reshape(-1, 1), y)
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
# plot real data
plt.plot(X, y, linewidth=8.0, label='true data')
predicted_1_y = []
predicted_2_y = []
# predict data points based on models
for i in X_longer:
predicted_1_y.append(clf_1.predict(np.array([i]).reshape(-1, 1)))
predicted_2_y.append(clf_2.predict(np.array([i]).reshape(-1, 1)))
# plot model predictions
plt.plot(X_longer, predicted_1_y, linewidth=6.0, ls=":", label='model, degree 1')
plt.plot(X_longer, predicted_2_y, linewidth=6.0, ls=":", label='model, degree 2')
plt.legend(loc='upper left')
plt.show()

This happens because linear and quadratic features will always grow up or down eventually. You would need an operation like square-root or log to pick up decaying feature you want.
A simple way to do this is to transform the input signal before fitting. For example, assume a square-root trend:
X = np.array(X)[:,None]**2
clf = SVR(kernel='linear').fit(X, y)
For more general use-cases, where you really don't know the trend you want, or don't want to assume a particular transformation like this, you might try a regression tool like Eureqa to compute the best transformation and mathematical model possible.

sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :
Is this probability curve indeed wrong? If so, what might cause that in my code?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, you will find a regularization parameter C that can be passed as argument while training the logistic regression model.
C : float, default: 1.0 Inverse of regularization strength; must be a
positive float. Like in support vector machines, smaller values
specify stronger regularization.
Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit, while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.
You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (i.e., with different values of the regularization parameter C, from 1 to 10):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
Predicted probs with models fitted with different values of C

Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph)
Please note that the value's we are talking about are the results of -(m*x+b)
When we reduce the limits of your graph, say by using
x_values = np.linspace(4, 7, 100), we get something which looks like a line:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100), we get the clearer sigmoid:

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?

Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.

I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.

sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).

I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.