Generalised additive model - Python - python

I'm trying to fit a non linear model using Generalized Additive model. How do I determine the number of splines to use. Is there a specific way to choose the number of splines? I have used a 3rd order (cubic) spline fitting. Below is the code.
from pygam import LinearGAM
from pygam.utils import generate_X_grid
# Curve fitting using GAM model - Penalised spline curve.
def modeltrain(time,value):
return LinearGAM(n_splines=58,spline_order=3).gridsearch(time, value)
# samples random x-values for prediction
XX = generate_X_grid(model)
#plots for vizualisation
plt.plot(XX, model.predict(XX), 'r--')
plt.plot(XX, model.prediction_intervals(XX,width=0.25), color='b', ls='-- ')
plt.scatter(t1, x1)
This is the expected result
Original data scatter plot
If the number of splines is not chosen correctly, then I get a incorrect fit.
Please, I would like a suggestion of methods to choose the number of splines accurately.

Typically for splines you choose a fairly high number of splines (~25) and you let the lambda smoothing parameter do the work of reducing the flexibility of the model.
For your use-case I would choose the default n_splines=25 and then do a gridsearch over the lambda parameter lam to find the best amount of smoothing:
def modeltrain(time,value):
return LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, lam=np.logspace(-3, 3, 11))
This will try 11 models from lam = 1e-3 to 1e3.
I think your choice of n_splines=58 is too high because it looks like it produces one spline per data-point.
If you really want to do a search over n_splines then you could do:
LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, n_splines=np.arange(50))
Note: the function generate_X_grid does NOT do random sampling for prediction, it actually just makes a dense linear-spacing of your X-values (time). The reason for this is to visualize how the learned model will interpolate.


Fitting a model with some known parameters to an experimental dataset in python, in order to optimise other parameters

I have an experimental dataset 1 which plots intensity as a function of energy. These are arrays of 1800 datapoints.
I have been trying to fit a model to this data, given by the equation below:
Imodel = I0 * ((math.cos(phi) + (beta * f1))**2 + (math.sin(phi) + (beta*f2))**2 + Ioff
I have 2 other datasets of f1 vs. energy and f2 vs. energy 2. These are arrays of 700 datapoints, albeit over the same energy range as the first dataset.
I want to use this model function together with the f1 and f2 data to find optimal values of the other 4 parameters (I0, phi, beta, Ioff) where this model function fits the experimental dataset exactly.
I have been looking into curve_fit and least_squares from the scipy.optimize package, as well as linear regression packages such as lmfit and scikit, but to no avail.
can anyone help? Thanks
Presently I have no representative data from Ayrtonb1 in order to test the method proposed below. The method seems convenient from theoretical basis but one cannot be sure that it will be satisfying with the OP data.
Nevertheless a preliminary test was carried out with a "toy" data (shown below).
I suppose that the screencopy below is sufficient to understand the method and to reproduce the calculus with real data.
The result of this preliminary test is rather good :
LRMSE<2 for a range up to 600. (Least Root Mean Square Error).
LRMSRE<2% (Least Root Mean Square Relative Error).
Your data and formula look suspiciously like resonant (or anomalous) X-ray diffraction data, with measurements of scattered intensity going across the Zn K-edge. Although you do not say this, the discussion here will assume that. You say you have 1800 measurements, presumably as a function of X-ray energy in eV. The resonant scattering factors (f1, f2) you show seem to be idealized and possibly "typical", and perhaps not specifically for the Zn K-edge -- at the very least the energy scale shown is not the same as your data.
You will want to treat the data and model the intensity as a function of X-ray energy. And you will want realistic values for f1 and f2 for the element of interest, and at the actual energy points for your data. I recommend using xraydb (full disclosure: I am the lead author) [pip install xraydb], so that you can do
import numpy as np
import xraydb
#edata, idata = function_to_extract_your_data()
# or perhaps testing with
edata = np.linspace(9500, 10500, 501)
f1 = xraydb.f1_chantler('Zn', edata)
f2 = xraydb.f2_chantler('Zn', edata)
As written, your intensity function does not directly depend on energy, though it might at a later date, say to make that offset be linear in energy, not just a constant. You might write a function like:
def intensity(en, phi, beta, scale=1, slope=0, offset=0, f1=-1, f2=1):
costerm = np.cos(phi) + beta*f1
sinterm = np.sin(phi) + beta*f2
return scale * (costerm**2 + sinterm**2) + slope*en + offset
with that you can start just trying out some values to get a feel for the function and how it compares to your data:
import matplotlib.pyplot as plt
beta = 0.025 # Wild Guess!
for phi in np.pi*np.arange(20)/10:
plt.plot(edata, intensity(edata, phi, beta, f1=f1, f2=f2), label='%.1f'%phi)
It kind of looks like your value for phi would be around 5.5 to 6 (or -0.8 to -0.3). You could also try different values of beta and plot that with your actual data.
With that model for intensity and a feel for what the range of parameters is, you could then try to fit your data. To do that, I would recommend using lmfit (full disclosure: I am the lead author) [pip install lmfit], where you can create a model from your intensity model function - this will use the names of the function arguments to name the fitting parameters.
from lmfit import Model
imodel = Model(intensity, independent_vars=['en', 'f1', 'f2'])
params = imodel.make_params(scale=1, offset=0, slope=0, beta=0.1, phi=5.5)
That independent_vars will tell Model to not make fitting Parameters for the function arguments f1 and f2 and to expect them to be passed into any evaluation or fit. The other function arguments (scale, phi, etc) will be expected to become fitting variables. You do have to create a "Parameters" object and must give initial values for all parameters. You can put bounds on these or fix them so they are not altered in the fit, as with
params['scale'].min = 0 # force scale to be positive
params['slope'].vary = False # slope will be fixed at 0.
You can then evaluate the model with
init_value = imodel.eval(params, en=edata, f1=f1, f2=f2)
And then eventually do a fit with
result =, params, en=edata, f1=f1, f2=f2)
plt.plot(edata, idata, label='data')
plt.plot(edata, init_value, label='initial fit')
plt.plot(edata, result.best_fit, label='best fit')
Finally, for analysis of X-ray resonant scattering be sure to consider including absorption corrections in that intensity calculation. As you go across the Zn K edge, the absorption depth of the sample may change dramatically if the Zn concentration is high.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Understanding Partial Dependence for Gradient Boosted Regression trees

I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?
how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.title('Partial dependence plot in the original coordinates');
You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.

Predicting on new data using locally weighted regression (LOESS/LOWESS)

How to fit a locally weighted regression in python so that it can be used to predict on new data?
There is statsmodels.nonparametric.smoothers_lowess.lowess, but it returns the estimates only for the original data set; so it seems to only do fit and predict together, rather than separately as I expected.
scikit-learn always has a fit method that allows the object to be used later on new data with predict; but it doesn't implement lowess.
Lowess works great for predicting (when combined with interpolation)! I think the code is pretty straightforward-- let me know if you have any questions!
Matplolib Figure
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.interpolate import interp1d
import statsmodels.api as sm
# introduce some floats in our x-values
x = list(range(3, 33)) + [3.2, 6.2]
y = [1,2,1,2,1,1,3,4,5,4,5,6,5,6,7,8,9,10,11,11,12,11,11,10,12,11,11,10,9,8,2,13]
# lowess will return our "smoothed" data with a y value for at every x-value
lowess = sm.nonparametric.lowess(y, x, frac=.3)
# unpack the lowess smoothed points to their values
lowess_x = list(zip(*lowess))[0]
lowess_y = list(zip(*lowess))[1]
# run scipy's interpolation. There is also extrapolation I believe
f = interp1d(lowess_x, lowess_y, bounds_error=False)
xnew = [i/10. for i in range(400)]
# this this generate y values for our xvalues by our interpolator
# it will MISS values outsite of the x window (less than 3, greater than 33)
# There might be a better approach, but you can run a for loop
#and if the value is out of the range, use f(min(lowess_x)) or f(max(lowess_x))
ynew = f(xnew)
plt.plot(x, y, 'o')
plt.plot(lowess_x, lowess_y, '*')
plt.plot(xnew, ynew, '-')
I've created a module called moepy that provides an sklearn-like API for a LOWESS model (incl. fit/predict). This enables predictions to be made using the underlying local regression models, rather than the interpolation method described in the other answers. A minimalist example is shown below.
# Imports
import numpy as np
import matplotlib.pyplot as plt
from moepy import lowess
# Data generation
x = np.linspace(0, 5, num=150)
y = np.sin(x) + (np.random.normal(size=len(x)))/10
# Model fitting
lowess_model = lowess.Lowess(), y)
# Model prediction
x_pred = np.linspace(0, 5, 26)
y_pred = lowess_model.predict(x_pred)
# Plotting
plt.plot(x_pred, y_pred, '--', label='LOWESS', color='k', zorder=3)
plt.scatter(x, y, label='Noisy Sin Wave', color='C1', s=5, zorder=1)
A more detailed guide on how to use the model (as well as its confidence and prediction interval variants) can be found here.
Consider using Kernel Regression instead.
statmodels has an implementation.
If you have too many data points, why not use sk.learn's radiusNeighborRegression and specify a tricube weighting function?
It's not clear whether it's a good idea to have a dedicated LOESS object with separate fit/predict methods like what is commonly found in Scikit-Learn. By contrast, for neural networks, you could have an object which stores only a relatively small set of weights. The fit method would then optimize the "few" weights by using a very large training dataset. The predict method only needs the weights to make new predictions, and not the entire training set.
Predictions based on LOESS and nearest neighbors, on the other hand, need the entire training set to make new predictions. The only thing a fit method could do is store the training set in the object for later use. If x and y are the training data, and x0 are the points at which to make new predictions, this object-oriented fit/predict solution would look something like the following:
model = Loess(), y) # No calculations. Just store x and y in model.
y0 = model.predict(x0) # Uses x and y just stored.
By comparison, in my localreg library, I opted for simplicity:
y0 = localreg(x, y, x0)
It really comes down to design choices, as the performance would be the same.
One advantage of the fit/predict approach is that you could have a unified interface like they do in Scikit-Learn, where one model could easily be swapped by another. The fit/predict approach also encourages a machine learning way to think of it, but in that sense LOESS is not very efficient, since it requires storing and using all the data for every new prediction. The latter approach leans more towards the origins of LOESS as a scatterplot smoothing algorithm, which is how I prefer to think about it. This might also shed some light on why statsmodel do it the way they do.
Check out the loess class in scikit-misc. The fitted object has a predict method:
loess_fit = loess(x, y, span=.01);;
preds = loess_fit.predict(x_new).values

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

