Mathematical background of statsmodels wls_prediction_std

Mathematical background of statsmodels wls_prediction_std - python

wls_prediction_std returns standard deviation and confidence interval of my fitted model data. I would need to know the way the confidence intervals are calculated from the covariance matrix. (I already tried to figure it out by looking at the source code but wasn't able to) I was hoping some of you guys could help me out by writing out the mathematical expression behind wls_prediction_std.

There should be a variation on this in any textbook, without the weights.
For OLS, Greene (5th edition, which I used) has
se = s^2 (1 + x (X'X)^{-1} x')
where s^2 is the estimate of the residual variance, x is vector or explanatory variables for which we want to predict and X are the explanatory variables used in the estimation.
This is the standard error for an observation, the second part alone is the standard error for the predicted mean y_predicted = x beta_estimated.
wls_prediction_std uses the variance of the parameter estimate directly.
Assuming x is fixed, then y_predicted is just a linear transformation of the random variable beta_estimated, so the variance of y_predicted is just
x Cov(beta_estimated) x'
To this we still need to add the estimate of the error variance.
As far as I remember, there are estimates that have better small sample properties.
I added the weights, but never managed to verify them, so the function has remained in the sandbox for years. (Stata doesn't return prediction errors with weights.)
Aside:
Using the covariance of the parameter estimate should also be correct if we use a sandwich robust covariance estimator, while Greene's formula above is only correct if we don't have any misspecified heteroscedasticity.
What wls_prediction_std doesn't take into account is that, if we have a model for the heteroscedasticity, then the error variance could also depend on the explanatory variables, i.e. on x.

Related

Why are statsmodels and sklearn returning different Lasso estimates?

I'm currently doing a project investigating the Bayesian Lasso and part of the project involves running some simulations. It can be shown that if we set independent and identically distributed conditional Laplace priors on the regression coefficients beta, the posterior mode is a frequentist Lasso estimate with tuning parameter 2 x sigma x lambda. So to check my work, I often make use of both SciKit and Statmodels (in particular their Lasso implementations) to calculate the frequentist Lasso estimate that should be approximately equal to the posterior mode (using the medians of sigma and lambda as estimates for 2 x sig x lam) and superimpose them onto my histograms. In the simulations I've run involving independent predictor variables, all Lasso estimates that I compute with scikit and statsmodels align, and they appear to coincide with the posterior mode when I superimpose them on my histograms.
However, if I use virtually the same code but in the case of (a) multicollinearity, (b) p > n or (c) n > p but n small e.g. n=12 and p = 9, sklearn and statsmodels output different Lasso estimates occasionally for the same tuning hyperparameter, which is confusing. Here's my code for the multicollinear predictors case (lam=1 here):
sigmamedian = np.median(np.sqrt(burned_sig2_tracesLam))
# Check with SciKit's Lasso
skmodel = Lasso(alpha=2*sigmamedian*lam/(2*len(y_train)),fit_intercept=False,tol=1e-12)
skmodel = skmodel.fit(X_trainStd,y_train-np.mean(y_train))
print(skmodel.coef_)
# StatsModel's Lasso
smLasso = sm.OLS(y_train - np.mean(y_train),X_trainStd).fit_regularized(alpha=2*sigmamedian*lam/(2*len(y_train)))
print(smLasso.params)
The output in this case is:
[ 0.28284396, -1.23878332, 1.08344865, 0.29263474, 0. , 0.00655085]
[ 0.45950192, -1.32361768, 0.8906759, 0.28951489, 0. , 0.]
These aren't the same, which is confusing because I've checked the documentation and I haven't made the common mistake of dealing with the intercept parameters the same way in scikit and statsmodels, the tuning hyperparameters inputted are the same, both modules use coordinate descent and LASSO estimates should be unique for n < p (and actually Tibshirani has shown that LASSO estimates are almost surely unique if the data is generated from a continuous distribution, even if p > n). After superimposing both of these onto the histograms of the posterior distributions, it appears to be that scikit's implementation that returns the posterior mode (or at least, a very good approximation of it). If I change lambda to 2, then scikit and statsmodels still return different things but statsmodels returns the estimate that accurately approximates the posterior mode.
I also generated data with p=25 and n = 20; the theory suggests that Lasso should set 5 of the coefficients to zero, but neither scikit nor statsmodels did this.
What's going on here?

Which algorithm to use for percentage features in my DV and IV, in regression?

I am using regression to analyze server data to find feature importance.
Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc.
I standardized all my Xs with (X-X_mean)/X_stddev. (Am I wrong in doing so?)
Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases:
Case 1: Predict a continuous valued Y
a.Will using a Lasso regression suffice?
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Case 2: Predict a %-ed valued Y, like "% resource used".
a. Should I use Beta-Regression? If so which package in Python offers
this?
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? So that means I do not standardize them, I will still standardize the other numeric IVs.
Final Aim for both Cases 1 and 2:
To find the % of impact of IVs on Y.
e.g.: When X1 increases by 1 unit, Y increases by 21%
I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. I hope I am correct in this regard.

Having a mix of predictors doesn't matter for any form of regression, this will only change how you interpret the coefficients. What does matter, however, is the type/distribution of your Y variable
Case 1: Predict a continuous valued Y
a.Will using a Lasso regression suffice?
Regular OLS regression will work fine for this
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
The interpretation of coefficients always follows a format like "for a 1 unit change in X, we expect an x-coefficient amount of change in Y, holding the other predictors constant"
Because you have standardized X, your unit is a standard deviation. So the interpretation will be "for a 1 standard deviation change in X, we expect an X-coefficient amount of change in Y..."
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Same as above. You units are still standard deviations, despite it originally coming from a percentage
Case 2: Predict a %-ed valued Y, like % resource used.
a. Should I use Beta-Regression? If so which package in Python offers
this?
This is tricky. The typical recommendation is to use something like binomial logistic regression when your Y outcome is a percentage.
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Same as interpretations above. But if you use logistic regression, they are in the units of log odds. I would recommend reading up on logistic regression to get a deeper sense of how this works
If I am wrong in standardizing the Xs which are a % already , is it
fine to use these numbers as 0.30 for 30% so that they fall within the
range 0-1? So that means I do not standardize them, I will still
standardize the other numeric IVs.
Standardizing is perfectly fine for variables in regression, but like I said, it changes your interpretation as your unit is now a standard deviation
Final Aim for both cases 1 & 2:
To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit,
Y increases by 21%
If your Y is a percentage and you use something like OLS regression, then that is exactly how you would interpret the coefficients (for a 1 unit change in X1, Y changes by some percent)

Your question confuses some concepts and jumbles a lot of terminology. Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y). But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below.
Standardization is not an "algorithm", just a technique for preprocessing data.
Standardization is needed for regression, but it is not needed for tree-based algorithms (RF/XGB/GBT) - with those, you can feed in raw numeric features directly (percents, totals, whatever).
(X-X_mean)/X_stddev is not standardization, it's normalization.
(An alternative to that is (true) standardization which is: (X-X_min)/(X_max-X_min), which transforms each variable into the range [0,1]; or you can transform to [0,1].
Last you ask about sensitivity analysis in regression : Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?
Stop and think about your underlying linearity assumption in "Final Aim for both cases 1 & 2: To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%".
you're assuming that the Dependent Variable has a linear relationship with each Independent Variable. But that is often not the case, it may be nonlinear. For example, if you're looking at the effect of Age on Salary, you would typically see it increase up to 40s/50s, then decrease gradually, and when you hit retirement age (say 65), decrease sharply.
so, you would model the effect of Age on Salary as quadratic or higher-order polynomial, by throwing in Age^2 and maybe Age^3 terms (or else sometimes you might see sqrt(X), log(X), log1p(X), exp(X) etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.)
obviously, Age has a huge effect on Salary, but we would not measure the sensitivity of Salary to Age by combining the (absolute value of) coefficients of Age, Age^2, Age^3.
if we only had a linear term for Age, the single coefficient for Age would massively understate the influence of Age on Salary, it would net "average out" the strong positive relationship for the regime Age<40 versus the negative relationship for Age>50
So the general answer to "Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?" is "Only if the relationship between Y and that X_i is linear, otherwise no".
In general, a better and easier way to do sensitivity analysis (without assuming linear response, or needing standardization of % features) is tree-based algorithms (RF/XGB/GBT) which generate feature importances.
As an aside, I understand your exercise tells you to use regression, but in general you get better faster feature-importance information from tree-based (RF/XGB), especially for a shallow tree (small value for max_depth, large value of nodesize e.g. >0.1% of training-set size). That's why people use it, even when their final goal is regression.
(Your question is would get better answers over at CrossValidated, but it's fine to leave here on SO, there is a crossover).

tensorflow_probability: Gradients always zero when backpropagating the log_prob of a sample of a normal distribution

As part of a project I am having trouble with the gradients of a normal distribution with tensorflow_probability. For this I create a normal distribution of which a sample is drawn. The log_prob of this sample shall then be fed into an optimizer to update the weights of network.
If I get the log_prob of some constant I always get non-zero gradients. Unfortunately I have not found any relevant help in tutorials or similar sources of help.
def get_log_prob(mu, std)
extracted_location = tf.squeeze(extracted_location)
normal = tfd.Normal(mu, scale=std)
samples = normal.sample(sample_shape=(1))
log_prob = normal.log_prob(samples)
return log_prob
const = tf.constant([0.1], dtype=np.float32)
log_prob = get_log_prob(const, 0.01)
grads = tf.gradients(log_prob, const)
with tf.Session() as sess:
gradients = sess.run([grads])
print('gradients', gradients)
Output: gradients [array([0.], dtype=float32)]
I expect to get non-zero gradients if when computing the gradient of a sample. Instead the output is always "0."

This is a consequence of TensorFlow Probability implementing reparameterization gradients (aka the "reparameterization trick", and in fact is the correct answer in certain situations. Let me show you how that 0. answer comes about.
One way to generate a sample from a normal distribution with some location and scale is to first generate a sample from a standard normal distribution (this is usually some library provided function, e.g. tf.random.normal in TensorFlow) and then shift and scale it. E.g. let's say the output of tf.random.normal is z. To get a sample x from the normal distribution with location loc and scale scale, you'd do: x = z * scale + loc.
Now, how does one compute value of the probability density of a number under the normal distribution? One way to do it is to reverse that transformation, so that you're now dealing with a standard normal distribution, and then compute the log-probability density there. I.e. log_prob(x) = log_prob_std_normal((x - loc) / scale) + f(scale) (the f(scale) term comes about from the change of variables involved in the transformation, it's form doesn't matter for this explanation).
You can now plug in the first expression into the second, you'll get log_prob(x) = log_prob_std_normal(z) + f(scale), i.e. the loc cancelled entirely! As a result, the gradient of log_prob with respect to loc is 0.. This also explains why you don't get a 0. if you evaluate the log probability at a constant: it'll be missing the forward transformation used to create the sample and you'll get some (typically) non-zero gradient.
So, when is this the correct behavior? The reparameterization gradients are correct when you're computing gradients of the distribution parameters with respect to an expectation of a function under that distribution. One way to compute such an expectation is to do a Monte-Carlo approximation, like so: tf.reduce_mean(g(dist.sample(N), axis=0). It sounds like that's what you're doing (where your g() is log_prob()), so it looks like the gradients are correct.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

LMFIT confidence interval uncertainty estimates error on python

the output error is :
MinimizerException:
Cannot determine Confidence Intervals without sensible uncertainty estimates
Why I got this error? How can I calculate uncertainty estimates and solve this problem??
for dosya1 in glob.glob("mean*"):
data1=np.genfromtxt(dosya1, skip_header=0, skip_footer=0, names=["wavelength","mean"])
x=data1["wavelength"]
mod=VoigtModel()
pars = mod.guess(y, x=x)
pars['gamma'].set(value=0.7, vary=True, expr="")
out=mod.fit(y,pars, x=x)
pars=lmfit.Parameters()
pars.add_many(('amp', out.params["amplitude"].value), ('sig', out.params["sigma"].value), ("gam",out.params["gamma"].value),("cent",out.params["center"].value))
def residual(p):
amp=p["amp"].value
sig=p["sig"].value
gam=p["gam"].value
cent=p["cent"].value
return ((wofz((x-cent + wofz(gam).imag)/(sig*(sqrt(2)))).real) / (sig*(sqrt(2))))- y
mini = lmfit.Minimizer(residual, pars)
result=mini.minimize()
ci = lmfit.conf_interval(mini, result)
lmfit.printfuncs.report_ci(ci)

You will get this error message if lmfit.minimize() (actually, leastsq(), which it calls) is unable to estimate uncertainties by inverting the curvature matrix. It uses these values (which are often very good estimates, BTW) as the scale for explicitly exploring parameter space. There are several possible reasons why leastsq() might fail to estimate uncertainties. Common reasons are that one or more of the variables is not found to alter the fit, or the residual contains NaNs.
It is hard to predict when this might happen. You should allow for the possibility and/or check that the initial fit succeeded and was able to make the initial estimate of the uncertainties (check result.errorbars) before calling conf_interval().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.