I wrote code with scikit-learn to build a SVR prediction model for one-dimensional toy data and then plot it with matplotlib.
The blue line is the true data. The model with the linear kernel fits a nice line, but for the kernel of degree 2, the predictions are not what I would expect. I would like to have a model that would predict the values of the blue line slightly below what the orange line is predicting. I painted a black line to visualize what I had in mind.
Why is this happening? The data seems a good candidate for a polynomial of degree 2. The black trend line following the true data and then decreasing much later on the right should result in a much better fit than what the green line is providing, if I just look at this plot. Shouldn't such a model be found with a polynomial of degree 2 based on the data? It would also curve nicely at X = 0 close to the blue line, instead of having this curvature with a higher estimated y value there.
How can I achieve a model that I want?
The code below is complete and self contained, run it to get the plot above (minus the painted black line)
# some data
y = [0, 3642, 6414, 9844, 13210, 16072, 18868, 22275, 25551, 28949, 31680, 34412, 37290, 39858, 42557,
45094, 47354, 49547, 51874, 54534, 55987, 55987, 58377, 60767, 63109, 65060, 66865, 68540, 70328,
72035, 73905, 75791, 77873, 79791, 81775, 83726]
X = range(0, len(y))
X_longer = range(0, len(y)*2)
# train models
from sklearn.svm import SVR
import numpy as np
clf_1 = SVR(kernel='poly', C=1e3, degree=1)
clf_2 = SVR(kernel='poly', C=1e3, degree=2)
clf_1.fit(np.array(X).reshape(-1, 1), y)
clf_2.fit(np.array(X).reshape(-1, 1), y)
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
# plot real data
plt.plot(X, y, linewidth=8.0, label='true data')
predicted_1_y = []
predicted_2_y = []
# predict data points based on models
for i in X_longer:
predicted_1_y.append(clf_1.predict(np.array([i]).reshape(-1, 1)))
predicted_2_y.append(clf_2.predict(np.array([i]).reshape(-1, 1)))
# plot model predictions
plt.plot(X_longer, predicted_1_y, linewidth=6.0, ls=":", label='model, degree 1')
plt.plot(X_longer, predicted_2_y, linewidth=6.0, ls=":", label='model, degree 2')
plt.legend(loc='upper left')
plt.show()
This happens because linear and quadratic features will always grow up or down eventually. You would need an operation like square-root or log to pick up decaying feature you want.
A simple way to do this is to transform the input signal before fitting. For example, assume a square-root trend:
X = np.array(X)[:,None]**2
clf = SVR(kernel='linear').fit(X, y)
For more general use-cases, where you really don't know the trend you want, or don't want to assume a particular transformation like this, you might try a regression tool like Eureqa to compute the best transformation and mathematical model possible.
Related
I tried to run a Ridge Regression on Boston housing data with python, but I had the following questions that I cannot find answer anywhere so I decided to post it here:
Is scaling recommended before fitting the model? Because I get the same score when I scale and when I don't scale. Also, what is the interpretation of the alpha/coeff graph in terms of choosing the best alpha?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
df = pd.read_csv('../housing.data',delim_whitespace=True,header=None)
col_names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df.columns = col_names
X = df.loc[:,df.columns!='MEDV']
col_X = X.columns
y = df['MEDV'].values
# Feature Scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
clf = Ridge()
coefs = []
alphas = np.logspace(-6, 6, 200)
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X_std, y)
coefs.append(clf.coef_)
plt.figure(figsize=(20, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Alpha/coefficient graph for scaled X
Alpha/coefficient graph for unscaled X
On the scaled data, when I compute the score and choose the alpha thanks to CV, I get:
from sklearn.linear_model import RidgeCV
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 7]).fit(X_std, y)
> clf.score(X_std, y)
> 0.74038
> clf.alpha_
> 5.0
On the non-scaled data, I even get a slightly better score with a completely different alpha:
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 6]).fit(X, y)
> clf.score(X, y)
> 0.74064
> clf.alpha_
> 0.01
Thanks for your insights on the matter, looking forward to reading your answers!
I think you should scale because Ridge Regularization penalizes large values, and so you don't want to lose meaningful features because of scaling issues. Perhaps you don't see a difference because the housing data is a toy dataset and is already scaled well.
A larger alpha is a stronger penalty on large values. The graph is showing you (though it has no labeling) that with a stronger alpha you send coefficients to zero more strongly. The more gradual lines are the smaller weights, so they're effected less or almost not at all until alpha becomes sufficiently large. The sharper ones are larger weights, so they drop to zero more quickly. When they do, the feature will disappear from your regression.
For the scaled data, the magnitude of design matrix is smaller, and the coefficients tend to be larger (and more L2 penalty is imposed). To minimize L2, we need more and more small coefficients. How to get more and more small coefficients? The way is to choose a very big alpha, so we can have more smaller coefficients. That is why if you scale the data, the optimal alpha is a great number.
I'd like to know how I can generate some random data whose plot resembles a "training curve." By training curve, I mean an array of training loss values from a learning model. These typically have larger values and variance at the beginning, and over time converge to some value with very little variance. It looks a bit like a noisy exponential curve.
This is the closest I've gotten to making random data that resembles a training curve. The problems are that the curve does not flatten out or converge like true loss curves, and there is too much variance on the flatter part.
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
And here is an actual training loss curve for reference.
I do not know enough about probability distributions to know if this is a stupid question. I only wanted to generate a random curve so that others had a data array to work with to help me with another question I have about logarithmic plots in matplotlib.
Here is the illustration how to do it with gamma distribution for the noise
x = np.arange(2000)
y = 0.00025 + 0.001 * np.exp(-x/100.) + scipy.stats.gamma(3).rvs(len(x))*(1-np.exp(-x/100))*2e-5
You can adjust the parameters here, to reduce the amount of noise etc
Seems like you could add a dampener to the noise value that is proportional to how far along the x axis that given value is. This would mean, in this case, the variance would decrease the flatter the curve got. Something like:
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
index = 0
for noise_value in np.nditer(noise):
noise[index] = noise_value - index
index = index + 1
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
Thus I think the noise values should be lower the further along X you go and it should achieve the result you want!
I'm trying to use linear regression to fit a polynomium to a set of points from a sinusoidal signal with some noise added, using linear_model.LinearRegression from sklearn.
As expected, the training and validation scores increase as the degree of the polynomium increases, but after some degree around 20 things start getting weird and the scores start going down, and the model returns polynomiums that don't look at all like the data that I use to train it.
Below are some plots where this can be seen, as well as the code that generated both the regression models and the plots:
How the thing works well until degree=17. Original data VS predictions:
After that it just gets worse:
Validation curve, increasing the degree of the polynomium:
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve
def make_data(N, err=0.1, rseed=1):
rng = np.random.RandomState(1)
x = 10 * rng.rand(N)
X = x[:, None]
y = np.sin(x) + 0.1 * rng.randn(N)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression())
X, y = make_data(400)
X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)
plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');
degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree',
degree, cv=7)
plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o',
color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');
I know the expected thing is that the validation score goes down when the complexity of the model increases, but why does the training score goes down as well? What can I be missing here?
First of all, here is how you can fix it by setting normalization flag True for the model;
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(normalize=True))
But why? In linear regression fit() function finds best-fitting model with Moore–Penrose inverse which is a common way to compute least-square solution. When you add polynomials of the values, your augmented features become very large very quickly if you do not normalize. These large values dominate the cost computed by least-square and lead to a model fits to larger values i.e higher order polynomial values instead of the data.
Plots looks better and the way they are supposed to be.
Training score is expected to go down as well due to overfitting of the model on training data. Error on validation goes down due to sine function's Taylor series expansion. So, as you increase degree of polynomial, your model improves to fit the sine curve better.
In ideal scenario if you don't have a function that expands to infinite degrees, you see training error going down (not monotonically but in general) and validation error going up after some degree (high for lower degrees -> low for some higher degree -> increasing after that).
I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :
Is this probability curve indeed wrong? If so, what might cause that in my code?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, you will find a regularization parameter C that can be passed as argument while training the logistic regression model.
C : float, default: 1.0 Inverse of regularization strength; must be a
positive float. Like in support vector machines, smaller values
specify stronger regularization.
Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit, while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.
You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (i.e., with different values of the regularization parameter C, from 1 to 10):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
Predicted probs with models fitted with different values of C
Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.
Your probability graph looks flat because you have, in a sense, "zoomed in" too much.
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph)
Please note that the value's we are talking about are the results of -(m*x+b)
When we reduce the limits of your graph, say by using
x_values = np.linspace(4, 7, 100), we get something which looks like a line:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100), we get the clearer sigmoid:
How to fit a locally weighted regression in python so that it can be used to predict on new data?
There is statsmodels.nonparametric.smoothers_lowess.lowess, but it returns the estimates only for the original data set; so it seems to only do fit and predict together, rather than separately as I expected.
scikit-learn always has a fit method that allows the object to be used later on new data with predict; but it doesn't implement lowess.
Lowess works great for predicting (when combined with interpolation)! I think the code is pretty straightforward-- let me know if you have any questions!
Matplolib Figure
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.interpolate import interp1d
import statsmodels.api as sm
# introduce some floats in our x-values
x = list(range(3, 33)) + [3.2, 6.2]
y = [1,2,1,2,1,1,3,4,5,4,5,6,5,6,7,8,9,10,11,11,12,11,11,10,12,11,11,10,9,8,2,13]
# lowess will return our "smoothed" data with a y value for at every x-value
lowess = sm.nonparametric.lowess(y, x, frac=.3)
# unpack the lowess smoothed points to their values
lowess_x = list(zip(*lowess))[0]
lowess_y = list(zip(*lowess))[1]
# run scipy's interpolation. There is also extrapolation I believe
f = interp1d(lowess_x, lowess_y, bounds_error=False)
xnew = [i/10. for i in range(400)]
# this this generate y values for our xvalues by our interpolator
# it will MISS values outsite of the x window (less than 3, greater than 33)
# There might be a better approach, but you can run a for loop
#and if the value is out of the range, use f(min(lowess_x)) or f(max(lowess_x))
ynew = f(xnew)
plt.plot(x, y, 'o')
plt.plot(lowess_x, lowess_y, '*')
plt.plot(xnew, ynew, '-')
plt.show()
I've created a module called moepy that provides an sklearn-like API for a LOWESS model (incl. fit/predict). This enables predictions to be made using the underlying local regression models, rather than the interpolation method described in the other answers. A minimalist example is shown below.
# Imports
import numpy as np
import matplotlib.pyplot as plt
from moepy import lowess
# Data generation
x = np.linspace(0, 5, num=150)
y = np.sin(x) + (np.random.normal(size=len(x)))/10
# Model fitting
lowess_model = lowess.Lowess()
lowess_model.fit(x, y)
# Model prediction
x_pred = np.linspace(0, 5, 26)
y_pred = lowess_model.predict(x_pred)
# Plotting
plt.plot(x_pred, y_pred, '--', label='LOWESS', color='k', zorder=3)
plt.scatter(x, y, label='Noisy Sin Wave', color='C1', s=5, zorder=1)
plt.legend(frameon=False)
A more detailed guide on how to use the model (as well as its confidence and prediction interval variants) can be found here.
Consider using Kernel Regression instead.
statmodels has an implementation.
If you have too many data points, why not use sk.learn's radiusNeighborRegression and specify a tricube weighting function?
It's not clear whether it's a good idea to have a dedicated LOESS object with separate fit/predict methods like what is commonly found in Scikit-Learn. By contrast, for neural networks, you could have an object which stores only a relatively small set of weights. The fit method would then optimize the "few" weights by using a very large training dataset. The predict method only needs the weights to make new predictions, and not the entire training set.
Predictions based on LOESS and nearest neighbors, on the other hand, need the entire training set to make new predictions. The only thing a fit method could do is store the training set in the object for later use. If x and y are the training data, and x0 are the points at which to make new predictions, this object-oriented fit/predict solution would look something like the following:
model = Loess()
model.fit(x, y) # No calculations. Just store x and y in model.
y0 = model.predict(x0) # Uses x and y just stored.
By comparison, in my localreg library, I opted for simplicity:
y0 = localreg(x, y, x0)
It really comes down to design choices, as the performance would be the same.
One advantage of the fit/predict approach is that you could have a unified interface like they do in Scikit-Learn, where one model could easily be swapped by another. The fit/predict approach also encourages a machine learning way to think of it, but in that sense LOESS is not very efficient, since it requires storing and using all the data for every new prediction. The latter approach leans more towards the origins of LOESS as a scatterplot smoothing algorithm, which is how I prefer to think about it. This might also shed some light on why statsmodel do it the way they do.
Check out the loess class in scikit-misc. The fitted object has a predict method:
loess_fit = loess(x, y, span=.01);
loess_fit.fit();
preds = loess_fit.predict(x_new).values
https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess.html