PyMC Robust Linear Regression with Measured Uncertainties - python

I use least squares regression of data with measured errors in both x and y and use the reduced chi-square (mean square weighted deviation: mswd) as a measure of the fit. However, some of the assumptions for using reduced chi-squared likely are not met and I'd like to move towards an mcmc/bayesian approach using PyMC. I've searched the web but can't exactly seem to find what I'm looking for, most examples assume the data uncertainty is Gaussian, but here I have measured uncertainties in x and y.
It seems like I should be able to do this in PyMC2 or PyMC3 with glm.
Here's a typical dataset plotted up:
And the data to go with it:
# Data in Columns, Observations in Rows
# Measured values x versus y,
# Measured standard deviations sx and sy.
x sx y sy
0.3779397 0.001889699 0.5130084 2.748546e-05
0.3659092 0.001829546 0.5129624 2.721838e-05
0.3430834 0.001715417 0.5129023 2.720073e-05
0.4121606 0.002060803 0.5130235 2.755231e-05
0.3075815 0.001537908 0.5128739 2.776967e-05
0.3794471 0.001897236 0.5129950 2.842079e-05
0.1447394 0.000723697 0.5126784 2.816200e-05
I'm looking for any examples and references where people have done this. Thanks in advance.

Related

Chi Squared Analysis on Data sets that don't have matching frequencies

I have 15 data sets each of which I have fitted with a curve. Now I am trying to determine the quality of fit by doing a chi-squared test; however, when I run my code:
chi, p_value = stats.chisquare(n, y)
where n is the actual data and y is the predicted data, I get the error
For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.1350785306607008
I can't seem to understand why they have to add up to the same total - are there any ways I can run a chi-squared test without muddling my data?
This chi-squared test for goodness of fit indeed requires the sums of both inputs to be (almost) the same. So, if you want to check whether your model fits the observations n well, you have to adjust the counts y of your model as described e.g. here. This could be done with a small wrapper:
from scipy.stats import chisquare
import numpy as np
def cs(n, y):
return chisquare(n, np.sum(n)/np.sum(y) * y)
Another possibility would be to go for R and use chisq.test.

Creating vector with intervals drawn from Poisson process

I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.
np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you

Fitting data from scatterplot

I have a Dataframe with two columns which I scatter plotted and got something like the following picture:
I would like to know if there is a way to find a distribution curve who best fits it, since the tutorials I've found focus in the distribution of one variable only (e.g. this case. I'm looking for something like this:
Does anyone have any directions or sample code for this case?
You can try fitting different degrees of polynomial using numpy.polyfit. It takes x, y and degree of fitting polynomial as inputs.
You can write a loop which iterates from 1 to say 5 for the degrees. Plot the f(x) using the coefficients which are returned by the function.
for d in degrees:
Fit using np.polyfit(x, y, d)
Get coefficients and optionally plot f(x) for degree d
Then find sum of squares (yi - f(xi))^2
Note that the sum of squares is just an indication of the error - in general it would go down as the degree increases but the plotting will kind of show you if you are overfitting to the data.
This is just one of the ways to go about solving the problem.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Mathematical background of statsmodels wls_prediction_std

wls_prediction_std returns standard deviation and confidence interval of my fitted model data. I would need to know the way the confidence intervals are calculated from the covariance matrix. (I already tried to figure it out by looking at the source code but wasn't able to) I was hoping some of you guys could help me out by writing out the mathematical expression behind wls_prediction_std.
There should be a variation on this in any textbook, without the weights.
For OLS, Greene (5th edition, which I used) has
se = s^2 (1 + x (X'X)^{-1} x')
where s^2 is the estimate of the residual variance, x is vector or explanatory variables for which we want to predict and X are the explanatory variables used in the estimation.
This is the standard error for an observation, the second part alone is the standard error for the predicted mean y_predicted = x beta_estimated.
wls_prediction_std uses the variance of the parameter estimate directly.
Assuming x is fixed, then y_predicted is just a linear transformation of the random variable beta_estimated, so the variance of y_predicted is just
x Cov(beta_estimated) x'
To this we still need to add the estimate of the error variance.
As far as I remember, there are estimates that have better small sample properties.
I added the weights, but never managed to verify them, so the function has remained in the sandbox for years. (Stata doesn't return prediction errors with weights.)
Aside:
Using the covariance of the parameter estimate should also be correct if we use a sandwich robust covariance estimator, while Greene's formula above is only correct if we don't have any misspecified heteroscedasticity.
What wls_prediction_std doesn't take into account is that, if we have a model for the heteroscedasticity, then the error variance could also depend on the explanatory variables, i.e. on x.

Categories

Resources