Python statsmodels trouble getting fitted model parameters - python

I'm using an AR model to fit my data and I think that I have done that successfully, but now I want to actually see what the fitted model parameters are and I am running into some trouble. Here is my code
model=ar.AR(df['price'],freq='M')
ar_res=model.fit(maxlags=50,ic='bic')
which runs without any error. However when I try to print the model parameters with the following code
print ar_res.params
I get the error
AssertionError: Index length did not match values

I am unable to reproduce this with current master.
import statsmodels.api as sm
from pandas.util import testing
df = testing.makeTimeDataFrame()
mod = sm.tsa.AR(df['A'])
res = mod.fit(maxlags=10, ic='bic')
res.params

Related

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

statsmodel summary col getting a latex key error?

I've been getting a key error when using the summary_col function.
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
Y = [0,1,0,0,1,1,1]
X = [5,10,15,20,25,2,7]
logit = sm.Logit(Y,X)
fit = logit.fit()
print(fit.summary())
logit_output = summary_col([fit],stars=True)
print(logit_output.as_latex())
gets me a "Key Error: '\m'". Surprisingly, fit.summary().as_latex() does not return this error.
I was reading a bit in the code and I think you are triggering a bug in the code of statsmodels.
Here is my wild explanation. The function summary_col returns an object of the Summary class and sets _merge_latex = True. In .as_latex() the next if-caluse is enabled. Here is the link to the source where I found the code below:
if self._merge_latex:
# create single tabular object for summary_col
tab = re.sub(to_replace, r'\\midrule\n', tab)
If you call fit.summary().as_latex() then _merge_latex = False by default. So you don't get into this part and don't get the same error.
Right now I am not sure what is wrong. I could think of two cases:
re.sub() is only called once and there is a leftover of stuff you want to replace
r'\\midrule\n' is wrong in this line and it shoul be '\\midrule\n' instead.
To make some progress you have to build a minimal example.
To disable the error to see if I am on the right trake, please add
logit_output._merge_latex = False
before your rerun
print(logit_output.as_latex())
your code and check if the error changes. This may generate an output you don't want.

Process finished with exit code -1073740940 (0xc0000374) using Scikit-learn KernelPCA

First of all, I tried to perform dimensionality reduction on my n_samples x 53 data using scikit-learn's Kernel PCA with precomputed kernel. The code worked without any issues when I tried using 50 samples at first. However, when I increased the number of samples into 100, suddenly I got the following message.
Process finished with exit code -1073740940 (0xC0000374)
Here's the detail of what I want to do:
I want to obtain the optimum value of kernel function hyperparameter in my Kernel PCA function, defined as the following.
from sklearn.decomposition.kernel_pca import KernelPCA as drm
from somewhere import costfunction
from somewhere_else import customkernel
def kpcafun(w,X):
# X is sample
# w is hyperparam
n_princomp = 2
drmodel = drm(n_princomp,kernel='precomputed')
k_matrix = customkernel (X,X,w)
transformed_x = drmodel.fit_transform(k_matrix)
cost = costfunction(transformed_x)
return cost
Therefore, to optimize the hyperparams I used the following code.
from scipy.optimize import minimize
# assume that wstart and optimbound are already defined
res = minimize(kpcafun, wstart, method='L-BFGS-B', bounds=optimbound, args=(X))
The strange thing is when I tried to debug the first 10 iterations of the optimization process, nothing strange has happened all values of the variables seemed normal. But, when I turned off the breakpoints and let the program continue the message appeared without any error notification.
Does anyone know what might be wrong with my code? Or anyone has some tips to resolve a problem like this?
Thanks

Find the sum of the residuals

I am doing a hands on exercise of Poissons Regression of Stats with Python in Fresco Play.
Problem statement is like:
Load the R dataset Insurance from the MASS package.
Capture the data as a pandas dataframe.
Build a Poisson regression model with a log of an independent variable
Holders, and dependent variable Claims.
Fit the model with data, and find the sum of the residuals.
I am stuck with the last line i.e. Sum of Residuals
I used np.sum(model.resid). But answer is not accepted
Here is my code
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
INS_data = sm.datasets.get_rdataset('Insurance','MASS').data
model = smf.poisson('Claims ~ np.log(Holders)', INS_data).fit()
print(np.sum(model.resid))
I was running the code in Python2 which gave wrong answer but running it in Python3 gave the correct answer. I don't know the reason but code works perfectly in Python3
For residual, you can use the basic concept of residual i.e. actual - predicted.
Here is the code snippet.
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
Insurance = sm.datasets.get_rdataset('Insurance','MASS')
data = Insurance.data
data['Holders_'] = np.log(data['Holders'])
model = smf.poisson('Claims ~ Holders_',data).fit()
y_predicted = p.predict(data['Holders_'])
residual = (data['Claims']-y_predicted)
print(sum(residual))
output
After much serach, i came to know that it is expecting cumulative sum so use
np.cumsum(model.resid)
It will pass in Frescoplay

anova_lm() python: on which model type does it work?

I am new to Python and trying to transition to a unique platform, Python, from Matlab+R platforms.
I need to do a regression of my data.
After reading what is available online - unfortunately not as numerous as for R just yet - I realized that I need to play with the following options:
import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(formula=formula_new, data=dta_new, family=sm.families.Gaussian())
mod2 = smf.ols(formula=formula_new, data=dta_new, family=sm.families.Gaussian())
mod3 = sm.OLS.from_formula(formula=formula_new, data=dta_new)
all three give me similar results.
What I really want to know is if there exists a function similar to anova() from R (with a nice table summarizing the comparison of different models, or within a model for different variables, as shown here http://www.r-bloggers.com/r-tutorial-series-anova-tables/ ) for any of these model options.
I tried to run
table = sm.stats.anova_lm(modX)
print table
with X = 1,2,3, basically for all models (those coming from smf. or sm.) but I always get the same error:
AttributeError: 'OLS'/'GLM' object has no attribute 'model'
with OLS or GLM depending on the type of model.
thanks for any input. am I not importing correctly modules? I am confused.
Links to applications/examples/tutorials of python are welcome.
rpy2 is not an option on my server, I am working on getting R3.0 installed, but it might take a while.
I figured why it wasn't working on all models.
anova_lm() wants the fit() attribute:
table = sm.stats.anova_lm(modX.fit())
print table
however, it works only with mod2 and mod3, therefore it would not work with GLM models.
Here some info I found online relevant to this issue. Hopefully it will be extended to GLM models soon.
http://comments.gmane.org/gmane.comp.python.pystatsmodels/11000

Categories

Resources