I am comparing the results from arima_model and ar_model. Here is what I can't understand:
Why are the resulting coefficients different? Is it because of the estimation method? (Different settings of the method property of fit() don't give identical results)
After getting the coefficients and backtesting the fitted results I match those of the AR(1) but not of ARIMA(1). Why?
What is ARIMA really doing in this simplest setting, isnt it supposed to be able to reproduce AR?
import pandas_datareader as pdr
import datetime
aapl = pdr.get_data_yahoo('AAPL', start=datetime.datetime(2006,1,1), end=datetime.datetime(2020,6,30))
aapl = aapl.resample('M').mean()
aapl['close_pct_change'] = aapl['Close'].pct_change()
from statsmodels.tsa.arima_model import ARIMA
mod = ARIMA(aapl['close_pct_change'][1:], order=(1,0,0))
res1 = mod.fit(method='mle')
print(res1.summary())
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
mod = AutoReg(aapl['close_pct_change'][1:], 1)
res2 = mod.fit()
print(res2.summary())
fitted_check1 = res1.params[0] + res1.params[1]*aapl['close_pct_change'][1:].shift(1)
print(fitted_check1[1:] - res1.fittedvalues)
fitted_check2 = res2.params[0] + res2.params[1]*aapl['close_pct_change'][1:].shift(1)
print(fitted_check2[1:] - res2.fittedvalues)
Why are the resulting coefficients different? Is it because of the estimation method? (Different settings of the method property of fit() don't give identical results)
AutoReg estimates parameters using OLS which is conditional (on the first observation) maximum likelihood. ARIMA implements full maximum likelihood and so uses the available information in the first observation when estimating parameters. In very large samples, the coefficients should be very close, and they are equal in their asymptotic limit. IN practice, they will always differ, although the difference should usually be minor.
After getting the coefficients and backtesting the fitted results I match those of the AR(1) but not of ARIMA(1). Why?
The two models use different representations. AutoReg(1)'s model is Y(t) = a + b Y(t-1) + eps(t). ARIMA(1,0,0) is specified as (Y(t) - c) = b * (Y(t-1) - c) + eps(t). If |b|<1, then in the large sample limit c = a / (1-b), although in finite samples this identity will not hold exactly.
What is ARIMA really doing in this simplest setting, isnt it supposed to be able to reproduce AR?
No. ARIMA uses the statsmodels Statespace framework which can estimate a wide range of models using Gaussian MLE.
ARIMA is essentially a special case of SARIMAX and this notebook provides a good introduction.
Related
So, I'm trying to develop a ml model for multiple linear regression that predicts the Y given n number of X variables. So far, my model can read in a data set and give the predicted value with a coefficient of determination as well as the respective coefficients for a 1-unit increase in X. The only issues are:
I can't get the p-value for the life of me, it says most of the time the data isn't shaped right due to it being 5 columns and 1329 rows. When I do get an output, they're just incorrect, I know because I did the regression in analysis toolpak in excel.
Is there a way to make the model recursive so that it recognizes the highest pvalue above .05 and calls itself again without said value until it hits the base case. Which would be something like
While dependent_v[pvalue] > .05:
Also what would be the best visualization method to show my data?
Thank you for any and all that help, I'm just starting to delve into machine learning on my own and want to hone my skills before an upcoming data science internship in the summer.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
def multipleReg():
dfreg = pd.read_csv("dfreg.csv")
#Setting dependent variables
dependent_v = ['Large_size', 'Mid_Level', 'Senior_Level', 'Exec_Level', 'Company_Location']
#Setting independent variable
independent_v = 'Salary_In_USD'
X = dfreg[dependent_v] #Drawing dependent variables from dataframe
y = dfreg[independent_v] #Drawing independent variable from dataframe
reg = linear_model.LinearRegression() #Initializing regression model
reg.fit(X.values,y) #Fitting appropriate values
predicted_sal = reg.predict([[1,0,1,0,0]]) #Prediction using 2 dimensions in array
percent_rscore = (reg.score(X.values,y)*100) #Model coefficient of determination
print('\n')
print("The predicted salary is:", predicted_sal)
print("The Coefficient of deterimination is:", "{:,.2f}%".format(percent_rscore))
#Printing coefficents of dependent variables(How much Y increases due to 1
#unit increase in X)
print("The corresponding coefficients for the dependent variables are:", reg.coef_)
As far as i know sklearn doesn't return p values, is better using the statsmodels library.
But if you need to use sklearn anyway, you can find various solutions here:
Find p-value (significance) in scikit-learn LinearRegression
I have 15 data sets each of which I have fitted with a curve. Now I am trying to determine the quality of fit by doing a chi-squared test; however, when I run my code:
chi, p_value = stats.chisquare(n, y)
where n is the actual data and y is the predicted data, I get the error
For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.1350785306607008
I can't seem to understand why they have to add up to the same total - are there any ways I can run a chi-squared test without muddling my data?
This chi-squared test for goodness of fit indeed requires the sums of both inputs to be (almost) the same. So, if you want to check whether your model fits the observations n well, you have to adjust the counts y of your model as described e.g. here. This could be done with a small wrapper:
from scipy.stats import chisquare
import numpy as np
def cs(n, y):
return chisquare(n, np.sum(n)/np.sum(y) * y)
Another possibility would be to go for R and use chisq.test.
I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$
I am a heavy R user and am recently learning python.
I have a question about how statsmodels.api handles duplicated features.
In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE?
Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5).
Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved.
Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.
The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106