Predicting on new data using locally weighted regression (LOESS/LOWESS)

Predicting on new data using locally weighted regression (LOESS/LOWESS) - python

How to fit a locally weighted regression in python so that it can be used to predict on new data?
There is statsmodels.nonparametric.smoothers_lowess.lowess, but it returns the estimates only for the original data set; so it seems to only do fit and predict together, rather than separately as I expected.
scikit-learn always has a fit method that allows the object to be used later on new data with predict; but it doesn't implement lowess.

Lowess works great for predicting (when combined with interpolation)! I think the code is pretty straightforward-- let me know if you have any questions!
Matplolib Figure
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.interpolate import interp1d
import statsmodels.api as sm
# introduce some floats in our x-values
x = list(range(3, 33)) + [3.2, 6.2]
y = [1,2,1,2,1,1,3,4,5,4,5,6,5,6,7,8,9,10,11,11,12,11,11,10,12,11,11,10,9,8,2,13]
# lowess will return our "smoothed" data with a y value for at every x-value
lowess = sm.nonparametric.lowess(y, x, frac=.3)
# unpack the lowess smoothed points to their values
lowess_x = list(zip(*lowess))[0]
lowess_y = list(zip(*lowess))[1]
# run scipy's interpolation. There is also extrapolation I believe
f = interp1d(lowess_x, lowess_y, bounds_error=False)
xnew = [i/10. for i in range(400)]
# this this generate y values for our xvalues by our interpolator
# it will MISS values outsite of the x window (less than 3, greater than 33)
# There might be a better approach, but you can run a for loop
#and if the value is out of the range, use f(min(lowess_x)) or f(max(lowess_x))
ynew = f(xnew)
plt.plot(x, y, 'o')
plt.plot(lowess_x, lowess_y, '*')
plt.plot(xnew, ynew, '-')
plt.show()

I've created a module called moepy that provides an sklearn-like API for a LOWESS model (incl. fit/predict). This enables predictions to be made using the underlying local regression models, rather than the interpolation method described in the other answers. A minimalist example is shown below.
# Imports
import numpy as np
import matplotlib.pyplot as plt
from moepy import lowess
# Data generation
x = np.linspace(0, 5, num=150)
y = np.sin(x) + (np.random.normal(size=len(x)))/10
# Model fitting
lowess_model = lowess.Lowess()
lowess_model.fit(x, y)
# Model prediction
x_pred = np.linspace(0, 5, 26)
y_pred = lowess_model.predict(x_pred)
# Plotting
plt.plot(x_pred, y_pred, '--', label='LOWESS', color='k', zorder=3)
plt.scatter(x, y, label='Noisy Sin Wave', color='C1', s=5, zorder=1)
plt.legend(frameon=False)
A more detailed guide on how to use the model (as well as its confidence and prediction interval variants) can be found here.

Consider using Kernel Regression instead.
statmodels has an implementation.
If you have too many data points, why not use sk.learn's radiusNeighborRegression and specify a tricube weighting function?

It's not clear whether it's a good idea to have a dedicated LOESS object with separate fit/predict methods like what is commonly found in Scikit-Learn. By contrast, for neural networks, you could have an object which stores only a relatively small set of weights. The fit method would then optimize the "few" weights by using a very large training dataset. The predict method only needs the weights to make new predictions, and not the entire training set.
Predictions based on LOESS and nearest neighbors, on the other hand, need the entire training set to make new predictions. The only thing a fit method could do is store the training set in the object for later use. If x and y are the training data, and x0 are the points at which to make new predictions, this object-oriented fit/predict solution would look something like the following:
model = Loess()
model.fit(x, y) # No calculations. Just store x and y in model.
y0 = model.predict(x0) # Uses x and y just stored.
By comparison, in my localreg library, I opted for simplicity:
y0 = localreg(x, y, x0)
It really comes down to design choices, as the performance would be the same.
One advantage of the fit/predict approach is that you could have a unified interface like they do in Scikit-Learn, where one model could easily be swapped by another. The fit/predict approach also encourages a machine learning way to think of it, but in that sense LOESS is not very efficient, since it requires storing and using all the data for every new prediction. The latter approach leans more towards the origins of LOESS as a scatterplot smoothing algorithm, which is how I prefer to think about it. This might also shed some light on why statsmodel do it the way they do.

Check out the loess class in scikit-misc. The fitted object has a predict method:
loess_fit = loess(x, y, span=.01);
loess_fit.fit();
preds = loess_fit.predict(x_new).values
https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess.html

Related

How to fit any non-linear functions in python?

I have already checked post1, post2, post3 and post4 but didn't help.
I have a data about a specific plant including two variables called "Age" and "Height". The correlation between them is non-linear.
To fit a model, one solution I assume is as follows:
If the non-linear function is
then we can bring in a new variable k where
so we have changed the first non-linear function into a multilinear regression one. Based on this, I have the following code:
data['K'] = data["Age"].pow(2)
x = data[["Age", "K"]]
y = data["Height"]
model = LinearRegression().fit(x, y)
print(model.score(x, y)) # = 0.9908571840250205
Am I doing correctly?
How to do with cubic and exponential functions?
Thanks.

for cubic polynomials
data['x2'] = data["Age"].pow(2)
data['x3'] = data["Age"].pow(3)
x = data[["Age", "x2","x3"]]
y = data["Height"]
model = LinearRegression().fit(x, y)
print(model.score(x, y))
you can handle exponential data by fitting log(y).
or find some library that can fit polynomials automatically t.ex: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

Hopefully you don't have a religious fervor for using SKLearn here because the answer I'm going to suggest is going to completely ignore it.
If you're interested doing regression analysis where you get to have complete autonomy with the fitting function, I'd suggest cutting directly down to the least-squares optimization algorithm that drives a lot of this type of work, which you can do using scipy
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import leastsq
x, y = np.array([0,1,2,3,4,5]), np.array([0,1,4,9,16,25])
# initial_guess[i] maps to p[x] in function_to_fit, must be reasonable
initial_guess = [1, 1, 1]
def function_to_fit(x, p):
return pow(p[0]*x, 2) + p[1]*x + p[2]
def residuals(p,y,x):
return y - function_to_fit(x,p)
cnsts = leastsq(
residuals,
initial_guess,
args=(y, x)
)[0]
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
xi = np.arange(0,10,0.1)
ax.plot(xi, [function_to_fit(x, cnsts) for x in xi])
plt.show()
Now this is a numeric approach to the solution, so I would recommend taking a moment to make sure you understand the limitations of such an approach - but for problems like these I've found they're more than adequate for functionalizing non-linear data sets without trying to do some hand-waving to make it if inside a linearizable manifold.

How to do a cubic or higher polynomial multiple regression in Python?

I have a set of data where longitude and latitude are the independent variables and temperature is the dependent variable. I want to be able to perform extrapolation to get temperature values outside of the range of the latitude and longitude. The best way I thought to do this was to perform a multiple regression.
I know that sklearn has the functionality to perform a linear multiple regression from their linear_model library.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit('independent data', 'dependent data')
However, my temperature doesn't seem to have a linear relationship with the latitude or with the longitude. Thus, some of the values I extrapolate seem to be off.
I was thinking that I could perhaps improve the extrapolation by performing a polynomial multiple regression instead of a linear one.
Is there some library out there already that provides this functionality?

Probably the easiest way is to do linear regression but perform some basic 'feature engineering' and make your own polynomial features. You could take a look at PolynomialFeatures which can help construct an array of polynomial features.
As a basic example consider this:
# make example data
x = np.linspace(0, 10, 10)
y = x**2 + np.random.rand(len(x))*10
# make new polynomial feature
x_squared = x**2
# perform LR
LR = LinearRegression()
LR.fit(np.c_[x, x_squared], y) # np.c_ stacks the feature into a 2D array.
# evaulate the model
eval_x = np.linspace(0, 10, 100)
eval_x_squared = eval_x**2
y_pred = LR.predict(np.c_[eval_x, eval_x_squared])
# plot the result
plt.plot(x, y, 'ko')
plt.plot(eval_x, y_pred, 'r-', label='Polynomial fit')
plt.legend()
The resulting figure looks like this:
Of course we had to manually construct our features in this example, but hopefully it shows you how it can be practically implemented.

PYMC3 Bayesian Prediction Cones

I'm still learning PYMC3, but I cannot find anything on the following problem in the docs. Consider the Bayesian Structure Time Series (BSTS) model from this question with no seasonality. This can be modeled in PYMC3 as follows:
import pymc3, numpy, matplotlib.pyplot
# generate some test data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
y_train = y_full[:90]
y_test = y_full[90:]
# specify the model
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_train.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_train)
trace = pymc3.sample(1000)
y_mean_pred = pymc3.sample_ppc(trace,samples=1000,model=model)['y'].mean(axis=0)
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='b')
ax.plot(t[:90],y_mean_pred,c='r')
matplotlib.pyplot.show()
Now I would like to predict the behavior for the next 10 time steps, i.e., y_test. I would also like to include credible regions over this area produce a Bayesian cone, e.g., see here. Unfortunately the mechanism for producing the cones in the aforementioned link is a little vague. In a more conventional AR model one could learn the mean regression coefficients and manually extend the mean curve. However, in this BSTS model there is no obvious way to do this. Alternatively, if there were regressors, then I could use a theano.shared and update it with a finer/extended grid to impute and extrapolate with sample_ppc, but thats not really an option in this setting. Perhaps sample_ppc is a red herring here, but its unclear how else to proceed. Any help would be welcome.

I think the following work. However, its super clunky and requires that I know how far in advance I want to predict before I train (in particular it percludes streaming usage or simple EDA). I suspect there is a better way and I would much rather accept a better solution by someone with more Pymc3 experience
import numpy, pymc3, matplotlib.pyplot, seaborn
# generate some data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
# mask the data that I want to predict (requires knowledge
# that one might not always have at training time).
cutoff_idx = 80
y_obs = numpy.ma.MaskedArray(y_full,numpy.arange(t.size)>cutoff_idx)
# specify and train the model, used the masked array to supply only
# the observed data
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_obs.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_obs)
trace = pymc3.sample(5000)
y_pred = pymc3.sample_ppc(trace,samples=20000,model=model)['y']
y_pred_mean = y_pred.mean(axis=0)
# compute percentiles
dfp = numpy.percentile(y_pred,[2.5,25,50,70,97.5],axis=0)
# plot actual data and summary posterior information
pal = seaborn.color_palette('Purples')
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='g',label='true value',alpha=0.5)
ax.plot(t,y_pred_mean,c=pal[5],label='posterior mean',alpha=0.5)
ax.plot(t,dfp[2,:],alpha=0.75,color=pal[3],label='posterior median')
ax.fill_between(t,dfp[0,:],dfp[4,:],alpha=0.5,color=pal[1],label='CR 95%')
ax.fill_between(t,dfp[1,:],dfp[3,:],alpha=0.4,color=pal[2],label='CR 50%')
ax.axvline(x=t[cutoff_idx],linestyle='--',color='r',alpha=0.25)
ax.legend()
matplotlib.pyplot.show()
This outputs the following which seems like a really bad prediction, but at least the code is supplying out of sample values.

sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :
Is this probability curve indeed wrong? If so, what might cause that in my code?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, you will find a regularization parameter C that can be passed as argument while training the logistic regression model.
C : float, default: 1.0 Inverse of regularization strength; must be a
positive float. Like in support vector machines, smaller values
specify stronger regularization.
Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit, while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.
You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (i.e., with different values of the regularization parameter C, from 1 to 10):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
Predicted probs with models fitted with different values of C

Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph)
Please note that the value's we are talking about are the results of -(m*x+b)
When we reduce the limits of your graph, say by using
x_values = np.linspace(4, 7, 100), we get something which looks like a line:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100), we get the clearer sigmoid:

Scipy Fmin Guassian model to real data

I've been trying to solve this for a bit and really just haven't seen an example or anything that my brain is able to use to move forward.
The goal is to find a model Gaussian curve by minimizing the total chi-squared between the real data and the model resulting from unknown parameters that require sensible estimations (the Gaussian is of unknown position, amplitude and width). scipy.optimize.fmin has come up but I've never used this before and I'm still very new to python...
Ultimately, I'd like to plot the original data along with the model - I have use pyplot before, it's just generating the model and using fmin that has me completely bewildered where I'm essentially here:
def gaussian(a, b, c, x):
return a*np.exp(-(x-b)**2/(2*c**2))
I've seen multiple ways to generate a model and this has rendered me confused and thus I have no code! I have imported my data file through np.loadtxt.
Thanks for anyone that can suggest a framework or help at all.

There are basically four (or five) main steps involved in model fitting problems like this:
Define your forward model, yhat = F(P, x), that takes a set of parameters P and your independent variable x, and estimates your response variable y
Define your loss function, loss = L(P, x, y) that you'd like to minimize over your parameters
Optional: define a function that returns the Jacobian matrix, i.e. the partial derivatives of your loss function w.r.t. your model parameters.*
Make an initial guess at your model parameters
Plug all these into one of the optimizers and get the fitted parameters for your model
Here's a worked example to get you started:
import numpy as np
from scipy.optimize import minimize
from matplotlib import pyplot as pp
# function that defines the model we're fitting
def gaussian(P, x):
a, b, c = P
return a*np.exp(-(x-b)**2 /( 2*c**2))
# objective function to minimize
def loss(P, x, y):
yhat = gaussian(P, x)
return ((y - yhat)**2).sum()
# generate a gaussian distribution with known parameters
amp = 1.3543
pos = 64.546
var = 12.234
P_real = np.array([amp, pos, var])
# we use the vector of real parameters to generate our fake data
x = np.arange(100)
y = gaussian(P_real, x)
# add some gaussian noise to make things harder
y_noisy = y + np.random.randn(y.size)*0.5
# minimize needs an initial guess at the model parameters
P_guess = np.array([1, 50, 25])
# minimize provides a unified interface to all of scipy's solvers. you
# can also access them individually in scipy.optimize, but the
# standalone versions have annoying differences in their syntax. for now
# we'll use the Nelder-Mead solver, which doesn't use the Jacobian. we
# also need to hand it x and y_noisy as additional args to loss()
res = minimize(loss, P_guess, method='Nelder-Mead', args=(x, y_noisy))
# res is a dict containing the results of the optimization. in particular we
# want the optimized model parameters:
P_fit = res['x']
# we can pass these to gaussian() to evaluate our fitted model
y_fit = gaussian(P_fit, x)
# now let's plot the results:
fig, ax = pp.subplots(1,1)
ax.hold(True)
ax.plot(x, y, '-r', lw=2, label='Real')
ax.plot(x, y_noisy, '-k', alpha=0.5, label='Noisy')
ax.plot(x, y_fit, '--b', lw=5, label='Fit')
ax.legend(loc=0, fancybox=True)
*Some solvers, e.g. conjugate gradient methods, take the Jacobian as an additional argument, and by and large these solvers are faster and more robust, but if you're feeling lazy and performance isn't all that critical then you can usually get away without providing the Jacobian, in which case it will use the finite differences method to estimate the gradients.
You can read more about the different solvers here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.