R Arima works but Python statsmodels SARIMAX throws invertibility error - python

I am comparing SARIMAX fitting results between R (3.3.1) forecast package (7.3) and Python's (3.5.2) statsmodels (0.8).
The R-code is:
library(forecast)
data("AirPassengers")
Arima(AirPassengers, order=c(2,1,1), seasonal=list(order=c(0,1,0),
period=12))$aic
[1] 1017.848
The Python code is:
from statsmodels.tsa.statespace import sarimax
import pandas as pd
AirlinePassengers =
pd.Series([112,118,132,129,121,135,148,148,136,119,104,118,115,126,
141,135,125,149,170,170,158,133,114,140,145,150,178,163,
172,178,199,199,184,162,146,166,171,180,193,181,183,218,
230,242,209,191,172,194,196,196,236,235,229,243,264,272,
237,211,180,201,204,188,235,227,234,264,302,293,259,229,
203,229,242,233,267,269,270,315,364,347,312,274,237,278,
284,277,317,313,318,374,413,405,355,306,271,306,315,301,
356,348,355,422,465,467,404,347,305,336,340,318,362,348,
363,435,491,505,404,359,310,337,360,342,406,396,420,472,
548,559,463,407,362,405,417,391,419,461,472,535,622,606,
508,461,390,432])
AirlinePassengers.index = pd.DatetimeIndex(end='1960-12-31',
periods=len(AirlinePassengers), freq='1M')
print(sarimax.SARIMAX(AirlinePassengers,order=(2,1,1),
seasonal_order=(0,1,0,12)).fit().aic)
Which throws an error: ValueError: Non-stationary starting autoregressive parameters found with enforce_stationarity set to True.
If I set enforce_stationarity (and enforce_invertibility, which is also required) to False, the model fit works but AIC is very poor (>1400).
Using some other model parameters for the same data, e.g., ARIMA(0,1,1)(0,0,1)[12] I can get identical results from R and Python with stationarity and invertibility checks enabled in Python.
My main question is: What explains the difference in behavior with some model parameters? Are statsmodels' invertibility checks different from forecast's Arima and is the other somehow "more correct"?
I also found a pull request related to fixing an invertibility calculation bug in statsmodels: https://github.com/statsmodels/statsmodels/pull/3506
After re-installing statsmodels with the latest source code from Github, I still get the same error with the code above, but setting enforce_stationarity=False and enforce_invertibility=False I get aic of around 1010 which is lower than in the R case. But model parameters are also vastly different.

Related

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

Loading saved params (a Pandas series) into a Statsmodels state-space model

I'm building a dynamic factor model using the excellent python package statsmodels, and I would like to pickle an estimated parameter vector so I can build the model again later, and load those params into it. (C.f., this Notebook built by Chad Fulton: https://github.com/ChadFulton/tsa-notebooks/blob/master/dfm_coincident.ipynb.)
In the following block of code, initial parameters are estimated with mod.fit() (using the Powell algo) and then given back to mod.fit() to complete the estimation (using the EM algo) using the initial parameters as initial_res.params. (The latter is a Pandas Series.)
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
I would like to pickle res.params (again, a small Pandas Series, and a small disk footprint). Then later build the model from scratch again, and load my saved parameters into it without having to re-estimate the model. Anyone know how that can be done?
Examples I have seen suggest pickling the results object res, but that can be a pretty big save. Building it from scratch is pretty simple, but estimation takes a while. It may be that estimation starting from the saved optimal params is quicker; but still, that's pretty amateurish, right?
TIA,
Drew
You can use the smooth method on any state space model to construct a results object from specific parameters. In your example:
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
res.params.to_csv(...)
# ...later...
params = pd.read_csv(...)
res = mod.smooth(params)

IV2SLS output summary much different than previous OLS estimation

I ran a different version of Mincer´s equation to estimate salary. Firstly, I ran an OLS version without considering endogeneity and the results are the following:Output summary
Salario is actually the natural log of it.
After that, I wrote next code to get an estimation with 2SLS method to solve endogeneity in variables No_Feliz and Años_Edu using No_Dep and max_edu_padres as instrumental variables for each one. However, the output is a bit confusing and I don´t know how to deal with it.
from statsmodels.sandbox.regression.gmm import IV2SLS
resultIV = IV2SLS(_dfb['Salario'], _dfb[['No_Feliz','Años_Edu']], _dfb[['No_Dep','Estatura','Años_Expe','Edad','Edad_2','Peso','Peso_2','DI','D_Hombre']]).fit()
resultIV.summary()
Results: Output summary IV2SLS
It´s clear the output has some issues (R2 too high compared to ols result, no result for F-statistic, coef of No_Feliz has a positive sign while it's negative in OLS estimation, the other exogenous variables are not taken into account despite the fact I included them)
I´d appreciate if someone could help me to fix it or at least make things a bit more clear to me. Thank you very much!
Don't use statsmodels's sandbox version which is not well tested. Instead use linearmodels. It is fully tested and correct.
Start by installing using pip install linearmodels
Then you need to be explicit about which variables are endogenous and need instrumenting, which are exogenous so in the model but not instrumented, and which are instruments only.
from linearmodels import IV2SLS
dep = _dfb['Salario']
exog = None
endog = _dfb[['No_Feliz','Años_Edu']]
instr = _dfb[['No_Dep','Estatura','Años_Expe','Edad','Edad_2','Peso','Peso_2','DI','D_Hombre']]
resultIV = IV2SLS(dep, exog, endog, instr).fit()
resultIV.summary # Note: this is a property so no ()
You can find the help which has more details. There are also some examples.

Python ARIMA predictions returning NaNs

I am trying a simple prediction using ARIMA. The code below produces all NaNs as output for the argument of order of (1,1,3), but for order argument of (1,1,2) and (1,1,4) i am able to get proper(numerical) output. The same function works fine, in some other installations with older/newer versions of pandas, statsmodels and pmdarima. I checked related questions here in Stackoverflow, but since the same function with the same argument is working in other libraries, i assume there is nothing wrong with the order argument of (1,1,3) and probably the bug is with library versions or some other configuration. Any help is appreciated.
from statsmodels.tsa.arima_model import ARIMA
def testarima():
trainseries = pd.Series([600.00,10.00,405.00,900.00,500.00,500.00,500.00,500.00,500.00,
500.00,1000.00,533.00,2784.11,1775.00,300.00,4289.42,1270.00,
500.00,2145.00,1650.00,1750.00,785.00,4137.50,2450.00,2194.00,
1750.00,1000.00,2250.00,1000.00,1055.98,1000.00,250.00,450.00,
540.00,2247.50,200.00,820.00,570.00,555.00])
model = ARIMA(trainseries, order=(1, 1, 3))
# print("train: " + str(train))
try:
model_fit = model.fit(disp=0)
fc, se, conf = model_fit.forecast(24, alpha=0.05)
print('result: '+str(fc))
return fc
except:
return np.zeros(24)
statsmodels v 0.10.2
pmdarima v 1.5.1
pandas 0.25.3
python 3.7.5
There is a warning output
C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning: invalid value encountered in log
Z_mat.astype(complex), R_mat, T_mat)
C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning: invalid value encountered in true_divide
Z_mat.astype(complex), R_mat, T_mat)
C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
'available', HessianInversionWarning)
But in other installation where this is working also these warnings appear, but there i get proper numerical output
use panda version 0.22.0.0
its not resolved then avoid statsmodel just import ARIMA, if you using tensorflow in colab its not supporting tensorflow 2 version

Python statsmodels trouble getting fitted model parameters

I'm using an AR model to fit my data and I think that I have done that successfully, but now I want to actually see what the fitted model parameters are and I am running into some trouble. Here is my code
model=ar.AR(df['price'],freq='M')
ar_res=model.fit(maxlags=50,ic='bic')
which runs without any error. However when I try to print the model parameters with the following code
print ar_res.params
I get the error
AssertionError: Index length did not match values
I am unable to reproduce this with current master.
import statsmodels.api as sm
from pandas.util import testing
df = testing.makeTimeDataFrame()
mod = sm.tsa.AR(df['A'])
res = mod.fit(maxlags=10, ic='bic')
res.params

Categories

Resources