How does offset in XGBoost is handled in binary:logistic objective function - python

I am working on a mortality prediction (binary outcome) problem with “base mortality probability” as my offset in the XGboost problem.
I have used gbtree booster and binary:logistic objective function. In my data data I have multiple observations/records having same X values but different offset values.
As per my understanding (please correct me, if wrong) the XGBoost under binary:logistic setup tries to fit a model of below representation. log(p/1-p) = offset + F(x). Where F(x) is optimized (for a specific loss function) using splits with various X values.
Thus, when the X values are exactly same, to get the F(x), I can use the predicted output (with outputmargin = True option) and subtract the offset from here. However, when I got the output, it turned out in the above mentioned approach, I am getting different values F(X) for a same set X. I believe the way offset is handled internally in the XGBoost is different from my understanding. Can anyone explain me this method/mathematical formulation of handlng offset.
I am specifically interested in extracting the value of F(x) (as this is additional information the model is providing) by adjusting the model prediction from the offset values.
Here are the sample codes:
library(xgboost)
x1 = runif(1000)
y1 = as.numeric(runif(1000)>.8)
y2 = as.numeric(runif(1000)>.8)
off1 = runif(1000)
off2 = runif(1000)
#stacking the data to have same X values
x= c(x1,x1)
y = c(y1,y2)
off = c(off1,off2)
length(unique(off)) # shows unique 2000 values
length(unique(x)) # shows unique 1000 values, i.e. each X is repeated once (as expected)
fulldata = cbind.data.frame(x,y,off)
train_dMtrix = xgb.DMatrix(data = as.matrix(x),
label = y,
base_margin = off)
params_list=list(booster = "gblinear", objective = "binary:logistic",
eta = 0.05, max_depth= 4, min_child_weight = 10, eval_metric = 'logloss')
set.seed(100)
xgbmodel = xgb.train(params = params_list, data = train_dMtrix, nrounds=100, callbacks = list(cb.gblinear.history()))
# Getting the prediction in link format
fulldata$Predicted_link = predict(xgbmodel, train_dMtrix, outputmargin = TRUE)
# Assuming Predicted_link = offset + F(x), calculating F(x) for each values of X
fulldata$F_x = fulldata$Predicted_link - fulldata$off
# As per my understanding, since the F(X) in purely independent of offset,
# the model predictions of F_x (not the predicted probability) should be exactly same for same values of x,
# irrespective of the corresponding offsets. Given I have 1000 distinct X values, I'm expecting 1000 distinct F_x values
length(unique(fulldata$F_x)) # shows almost 2000 unique values, which is contrary to my expectation.

Related

Error when observing on uniform with sampled parameters in PyMC

I’m new to PyMC and am trying to model a situation where you are rolling marbles at a wall and trying to find the block. The data is only for the values where the marble hits the block.
I’m first sampling an x location and size and then calculating a point from those with a uniform, but I’m getting an error.
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
with basic_model:
# We are assuming independence of these.
x = pm.Uniform("x", lower=1, upper=30)
l = pm.Uniform("l", lower=1, upper=30)
lower = pm.Deterministic('lower', x-0.5*l)
upper = pm.Deterministic('upper', x+0.5*l)
point_x = pm.Uniform('point_x', lower=lower, upper=upper, observed=x_vals)
pm.sample()
With the error:
SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'x_interval__': array(0.), 'l_interval__': array(0.)}
Initial evaluation results:
x_interval__ -1.39
l_interval__ -1.39
point_x -inf
Name: Log-probability of test_point, dtype: float64
Clearly the issue is with point_x. I’m guessing the error has to do with the fact that the observed data may potentially fall outside the lower-upper range depending on the value of x and l sampled. But how might I fix this?
The sampler doesn't know how to handle starting off in an invalid region of the parameter space. A quick and dirty fix is to provide testval arguments that ensure the sampling begins in a logically valid solution. For example, we know the minimum block must have:
l_0 = np.max(x_vals) - np.min(x_vals)
x_0 = np.min(x_vals) + 0.5*l_0
and could use those:
x = pm.Uniform("x", lower=1, upper=30, testval=x_0)
l = pm.Uniform("l", lower=1, upper=30, testval=l_0)
Also, the nature of this model leads to many rejections due to impossibility, so you may want to use Metropolis for sampling, which almost always needs more steps and tuning
pm.sample(tune=10000, draws=10000, step=pm.Metropolis())
Alternative Models
Otherwise, consider reparameterizing so that only valid solutions are in the parameter space. One approach would be to sample l and then use that to constrain x. Something like:
other_model = pm.Model()
x_min = np.min(x_vals)
x_max = np.max(x_vals)
l_0 = x_max - x_min
with other_model:
# these have logical constraints from the data
l = pm.Uniform("l", lower=l_0, upper=30)
x = pm.Uniform("x", lower=x_max - 0.5*l, upper=x_min + 0.5*l)
lower = pm.Deterministic('lower', x - 0.5*l)
upper = pm.Deterministic('upper', x + 0.5*l)
point_x = pm.Uniform('point_x', lower=lower, upper=upper, observed=x_vals)
res = pm.sample(step=pm.NUTS(), return_inferencedata=True)
Another approach would be to sample lower and upper directly, and compute the x and l as deterministic variables from those.

Is it possible to shorten individual columns in pandas dataframes?

I am working with a 1000x40 data frame where I am fitting each column with a function.
For this, I am normalizing the data to run from 0 to 1 and then I fit each column by this sigmoidal function,
def func_2_2(x, slope, halftime):
yfit = 0 + 1 / (1+np.exp(-slope*(x-halftime)))
return yfit
# inital guesses for function
slope_guess = 0.5
halftime_guess = 100
# Construct initial guess array
p0 = np.array([slope_guess, halftime_guess])
# set up curve fit
col_params = {}
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col].values
popt = curve_fit(func_2_2, x, y, p0=p0, maxfev=10000)
col_params[col] = popt[0]
This code is working well for me, but the data fitting would physically make more sense if I could cut each column shorter on an individual basis. The data plateaus for some of the columns already at e.g. 500 data points, and for others at 700 to virtually 1. I would like to implement a function where I simply cut off the column after it arrives at 1 (and there is no need to have another 300 or more data points to be included in the fit). I thought of cutting off 50 data points starting from the end if their average number is close to 1. I would dump them, until I arrive at the data that I want in be included.
When I try to add a function where I try to determine the average of the last 50 datapoints with e.g. passing the y-vector from above like this:
def cutdata(y)
lastfifty = y.tail(50).average
I receive the error message
AttributeError: 'numpy.ndarray' object has no attribute 'tail'
Does my approach make sense and is it possible within the data frame?
- Thanks in advance, any help is greatly appreciated.
print(y)
gives
[0.00203105 0.00407113 0.00145333 ... 0.99178177 0.97615621 0.97236191]
This has to do with the use of pd.Series.values, which will give you an np.ndarray instead of a pd.Series.
A conservative change to your code would move the use of .values into the curve_fit call. It may not even be necessary there, since a pd.Series is already a np.ndarray for most purposes.
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col] # No more .values here.
popt = curve_fit(func_2_2, x, y.values, p0=p0, maxfev=10000)
col_params[col] = popt[0]
The essential part is highlighted by the comment, which is that your y variable will remain a pd.Series. Then you can get the average of the last observations.
y.tail(50).mean()

Python stats module: How to extract confidence/prediction intervals from GPy?

After having looked through all the docs and examples online, I have not been able to find a way to extract information regarding the confidence or prediction intervals from GPy models.
I generate dummy data like this,
## Generating data for regression
# First, regular sine wave + normal noise
x = np.linspace(0,40, num=300)
noise1 = np.random.normal(0,0.3,300)
y = np.sin(x) + noise1
## Second, an upward trending starting midway, with its own noise as well
temp = x[150:]
noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150)
y[150:] = y[150:] + noise2
plt.plot(x, y)
and then estimate a basic model,
## Pre-processing
X = np.expand_dims(x, axis=1)
Y = np.expand_dims(y, axis=1)
## Model
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
model1 = GPy.models.GPRegression(X, Y, kernel)
However, nothing makes it clear how to proceed from there... Another question here tried asking the same thing, but that answer does not work any more, and seems rather unsatisfactory, for such an important element of statistical modelling.
Given a model, and a set of target x values we want to generate the intervals at, you can extract the intervals using:
intervals = model.predict_quantiles( X = target_x_vals, quantiles = (2.5, 97.5) )
You can change the quantiles argument to get the appropriate width ones. The documentation for this function is found at: https://gpy.readthedocs.io/en/deploy/_modules/GPy/core/gp.html

Fixing fit parameters in curve_fit

I have a function Imaginary which describes a physics process and I want to fit this to a dataset x_interpolate, y_interpolate. The function is a form of a Lorentzian peak function and I have some initial values that are user given, except for f_peak (the peak location) which I find using a peak finding algorithm. All of the fit parameters, except for the offset, are expected to be positive and thus I have set bounds_I accordingly.
def Imaginary(freq, alpha, res, Ms, off):
numerator = (2*alpha*freq*res**2)
denominator = (4*(alpha*res*freq)**2) + (res**2 - freq**2)**2
Im = Ms*(numerator/denominator) + off
return Im
pI = np.array([alpha_init, f_peak, Ms_init, 0])
bounds_I = ([0,0,0,0, -np.inf], [np.inf,np.inf,np.inf, np.inf])
poptI, pcovI = curve_fit(Imaginary, x_interpolate, y_interpolate, pI, bounds=bounds_I)
In some situations I want to keep the parameter f_peak fixed during the fitting process. I tried an easy solution by changing bounds_I to:
bounds_I = ([0,f_peak+0.001,0,0, -np.inf], [np.inf,f_peak-0.001,np.inf, np.inf])
This is for many reasons not an optimal way of doing this so I was wondering if there is a more Pythonic way of doing this? Thank you for your help
If a parameter is fixed, it is not really a parameter, so it should be removed from the list of parameters. Define a model that has that parameter replaced by a fixed value, and fit that. Example below, simplified for brevity and to be self-contained:
x = np.arange(10)
y = np.sqrt(x)
def parabola(x, a, b, c):
return a*x**2 + b*x + c
fit1 = curve_fit(parabola, x, y) # [-0.02989396, 0.56204598, 0.25337086]
b_fixed = 0.5
fit2 = curve_fit(lambda x, a, c: parabola(x, a, b_fixed, c), x, y)
The second call to fit returns [-0.02350478, 0.35048631], which are the optimal values of a and c. The value of b was fixed at 0.5.
Of course, the parameter should be removed from the initial vector pI and the bounds as well.
You might find lmfit (https://lmfit.github.io/lmfit-py/) helpful. This library adds a higher-level interface to the scipy optimization routines, aiming for a more Pythonic approach to optimization and curve fitting. For example, it uses Parameter objects to allow setting bounds and fixing parameters without having to modify the objective or model function. For curve-fitting, it defines high level Model functions that can be used.
For you example, you could use your Imaginary function as you've written it with
from lmfit import Model
lmodel = Model(Imaginary)
and then create Parameters (lmfit will name the Parameter objects according to your function signature), providing initial values:
params = lmodel.make_params(alpha=alpha_init, res=f_peak, Ms=Ms_init, off=0)
By default all Parameters are unbound and will vary in the fit, but you can modify these attributes (without rewriting the model function):
params['alpha'].min = 0
params['res'].min = 0
params['Ms'].min = 0
You can set one (or more) of the parameters to not vary in the fit as with:
params['res'].vary = False
To be clear: this does not require altering the model function, making it much easier to change with is fixed, what bounds might be imposed, and so forth.
You would then perform the fit with the model and these parameters:
result = lmodel.fit(y_interpolate, params, freq=x_interpolate)
you can get a report of fit statistics, best-fit values and uncertainties for parameters with
print(result.fit_report())
The best fit Parameters will be held in result.params.
FWIW, lmfit also has builtin Models for many common forms, including Lorentzian and a Constant offset. So, you could construct this model as
from lmfit.models import LorentzianModel, ConstantModel
mymodel = LorentzianModel(prefix='l_') + ConstantModel()
params = mymodel.make_params()
which will have Parameters named l_amplitude, l_center, l_sigma, and c (where c is the constant) and the model will use the name x for the independent variable (your freq). This approach can become very convenient when you may want to change the functional form of the peaks or background, or when fitting multiple peaks to a spectrum.
I was able to solve this issue regarding arbitrary number of parameters and arbitrary positioning of the fixed parameters:
def d_fit(x, y, param, boundMi, boundMx, listparam):
Sparam, SboundMi, SboundMx = asarray([]), asarray([]), asarray([])
Nparam, NboundMi, NboundMx = asarray([]), asarray([]), asarray([])
for i in range(len(param)):
if(listparam[i] == 1):
Sparam = append(Sparam,asarray(param[i]))
SboundMi = append(SboundMi,asarray(boundMi[i]))
SboundMx = append(SboundMx,asarray(boundMx[i]))
else:
Nparam = append(Nparam,asarray(param[i]))
def funF(x, Sparam):
j = 0
for i in range(len(param)):
if(listparam[i] == 1):
param[i] = Sparam[i-j]
else:
param[i] = Nparam[j]
j = j + 1
return fun(x, param)
return curve_fit(lambda x, *Sparam: funF(x, Sparam), x, y, p0 = Sparam, bounds = (SboundMi,SboundMx))
In this case:
param = [a,b,c,...] # parameters array (any size)
boundMi = [min_a, min_b, min_c,...] # minimum allowable value of each parameter
boundMx = [max_a, max_b, max_c,...] # maximum allowable value of each parameter
listparam = [0,1,1,0,...] # 1 = fit and 0 = fix the corresponding parameter in the fit routine
and the root function is define as
def fun(x, param):
a,b,c,d.... = param
return a*b/c... # any function of the params a,b,c,d...
This way, you can change the root function and the number of parameters without changing the fit routine.
And, at any time, you can fix or let fit any parameter by changing "listparam".
Use like this:
popt, pcov = d_fit(x, y, param, boundMi, boundMx, listparam)
"popt" and "pcov" are 1D arrays of the size of the number of "1" in "listparam" bringing the results of the fitted parameters (best value and err matrix)
"param" will ramain an 1D array of the same size of the original (input) "param", HOWEVER IT WILL BE UPDATED AUTOMATICALLY TO THE FITTED VALUES (same as "popt") for the fitted values, keeping the fixed values according to "listparam"
Hope can be usefull!
Obs1: x = 1D-array independent values and y = 1D-array dependent values
Obs2: This is my first post. Please let me know if I can improove it!

How to code a hierarchical mixture model of multivariate normals using PYMC

I successfully implemented a mixture of 3 normals using PyMC (shown at https://drive.google.com/file/d/0Bwnmbh6ueWhqSkUtV1JFZDJwLWc, and similar to the question asked at How to model a mixture of 3 Normals in PyMC?)
My next step is to try and code mixtures of multivariate normals.
There is, however, an additional complexity to the data - a hierarchy, with sets of observations belonging to a parent observation. The clustering is done on the parent observations, and not on the individual observations themselves. This first step generates the code (60 parents, with 50 observations per each parent), and works fine.
import numpy as np
import pymc as mc
n = 3 #mixtures
B = 5 #Bias between those at different mixtures
tau = 3 #Variances
nprov = 60 #number of parent observations
mu = [[0,0],[0,B],[-B,0]]
true_cov0 = np.array([[1.,0.],[0.,1.]])
true_cov1 = np.array([[1.,0.],[0.,tau**(2)]])
true_cov2 = np.array([[tau**(-2),0],[0.,1.]])
trueprobs = [.4, .3, .3] #probability of being in each of the three mixtures
prov = np.random.multinomial(1, trueprobs, size=nprov)
v = prov[:,1] + (prov[:,2])*2
numtoeach = 50
n_obs = nprov*numtoeach
vAll = np.tile(v,numtoeach)
ndata = numtoeach*nprov
p1 = range(nprov)
prov1 = np.tile(p1,numtoeach)
data = (vAll==0)*(np.random.multivariate_normal(mu[0],true_cov0,ndata)).T \
+ (vAll==1)*(np.random.multivariate_normal(mu[1],true_cov1,ndata)).T \
+ (vAll==2)*(np.random.multivariate_normal(mu[2],true_cov2,ndata)).T
data=data.T
However, when I try and use PyMC to do the sampling, I run intro trouble ('error: failed in converting 3rd argument `tau' of flib.prec_mvnorm to C/Fortran array')
p = 2 #covariates
prior_mu1=np.ones(p)
prior_mu2=np.ones(p)
prior_mu3=np.ones(p)
post_mu1 = mc.Normal("returns1",prior_mu1,1,size=p)
post_mu2 = mc.Normal("returns2",prior_mu2,1,size=p)
post_mu3 = mc.Normal("returns3",prior_mu3,1,size=p)
post_cov_matrix_inv1 = mc.Wishart("cov_matrix_inv1",n_obs,np.eye(p) )
post_cov_matrix_inv2 = mc.Wishart("cov_matrix_inv2",n_obs,np.eye(p) )
post_cov_matrix_inv3 = mc.Wishart("cov_matrix_inv3",n_obs,np.eye(p) )
#Combine prior means and variance matrices
meansAll= np.array([post_mu1,post_mu2,post_mu3])
precsAll= np.array([post_cov_matrix_inv1,post_cov_matrix_inv2,post_cov_matrix_inv3])
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=nprov)
#This step accounts for the hierarchy: observations' means are equal to their parents mean
#Parent is labeled prov1
#mc.deterministic
def mean(category=category, meansAll=meansAll):
lat = category[prov1]
new = meansAll[lat]
return new
#mc.deterministic
def prec(category=category, precsAll=precsAll):
lat = category[prov1]
return precsAll[lat]
obs = mc.MvNormal( "observed returns", mean, prec, observed = True, value = data)
I know the problem is not with the format of the simulated observed data, because this step would work fine, in place of the above:
obs = mc.MvNormal( "observed returns", post_mu3, post_cov_matrix_inv3, observed = True, value = data )
As a result, I think the issue is how the mean vector ('mean') and the covariance matrix ('prec') are entered, I just don't know how. Like I said, this worked fine with mixtures of normal distributions, but mixtures of multivariate normals is adding a complexity I can't figure out.
This is a good example of the difficulty PyMC has with vectors of multivariate variables. Not that its difficult--just not as elegant as it should be. You should create a list comprehension of the MVN nodes and wrap that as an observed stochastic.
#mc.observed
def obs(value=data, mean=mean, prec=prec):
return sum(mc.mv_normal_like(v, m, T) for v,m,T in zip(data, mean, prec))
Here is the IPython notebook

Categories

Resources