I fit a statsmodels SARIMAX model to my data, leveraging some exogenous variables.
How to extract the fitted regression parameters for the exogenous variables? It is clear per documentation how to get AR, MA coefficients, but nothing about exog coefficients. Any advice?
Code snippet below:
#imports
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
#X and Y variables, index as dates, X has several columns with exog variables
X = df[factors]
Y = df[target]
#lets fit it
model= SARIMAX(endog=Y[:'2020-04-13'], exog = X[:'2020-04-13'], order = (5,2,1))
#fit the model
model_fit = model.fit(disp=0)
#get AR coefficients
model_fit.polynomial_ar
There isn't a specific attribute for this, but you can always access all parameters using the model_fit.params attribute.
For the SARIMAX model, the exog parameters are always right after any trend parameters, so the following should always work:
exog_params = model_fit.params[model.k_trend:model.k_trend + model.k_exog]
Related
There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?
In stats model it would be done like this
import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())
To return the fit with the coefficients and the error on them
For Sklearn you can use the TransformedTargetRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)
But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?
EDIT
I found a good example for the special case I care about here
https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm
Just to add a lengthy comment here, I believe that TransformedTargetRegressor does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict method is called. It does not express the coefficients in units of the untransformed outcome.
Example:
import pandas as pd
import statsmodels.api as sm
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
create some sample data:
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names
X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
Sklearn first:
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
Statsmodels on transformed outcome:
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)
#const -0.458678
#sepal length (cm) 0.356758
#sepal width (cm) -0.296294
#dtype: float64
You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor yields the same coefficients as statsmodels.OLS with the transformed outcome. TransformedTargetRegressor does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.
What to do instead?
If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).
Example:
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] += 0.018 # this is one sigma
# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
return inv_func(data.dot(coeffs))
# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)
Then you can say e.g. about the mean shift in the untransformed outcome:
variant1.mean()-original.mean()
#0.2523083548367202
In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error #grisaitis provided a great answer explaining the main concepts behind it.
If you only want to use a plug-and-play function that will work with sciait-learn you can use this:
def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
y_true: 'np.ndarray', X: 'np.ndarray'):
"""Function that calculates the standard deviation of the coefficients of
a linear regression.
Parameters
----------
reg : sklearn.linear_model.LinearRegression
LinearRegression object which has been fitted
y_true : np.ndarray
array containing the target variable
X : np.ndarray
array containing the features used in the regression
Returns
-------
beta_std
Standard deviation of the regression coefficients
"""
y_pred = reg.predict(X) # get predictions
errors = y_true - y_pred # calculate residuals
sigma_sq_hat = np.var(errors) # calculate residuals std error
sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T # X)
return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances
I have a very basic question, which I can somehow not find a real answer for.
Assuming I have a model:
import statsmodels.formula.api as smf
model = smf.ols(....).fit()
What is the difference between model.fittedvalues and model.predict ?
model.predict is a method for predicting values, so you can provide it an unseen dataset:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2),columns=['X','Y'])
model = smf.ols('Y ~ X',data=df).fit()
model.predict(exog=pd.DataFrame({'X':[1,2,3]}))
If you do not provide the exog argument, it returns the prediction by calling the data stored under the object, you see this under the source code:
def predict(self, params, exog=None):
"""
Return linear predicted values from a design matrix.
Parameters
----------
params : array_like
Parameters of a linear model.
exog : array_like, optional
Design / exogenous data. Model exog is used if None.
Returns
-------
array_like
An array of fitted values.
Notes
-----
If the model has not yet been fit, params is not optional.
"""
# JP: this does not look correct for GLMAR
# SS: it needs its own predict method
if exog is None:
exog = self.exog
return np.dot(exog, params)
On the other hand, model.fittedvalues is a property and it is the fitted values that are stored. It will be exactly the same as model.predict() for reasons explain above.
You can look at the methods for this type too.
When calling smf.ols(....).fit(), you fit your model to the data. I.e. for every data point in your data set, the model tries to explain it and computes a value for it. At this point, the model only tried to explain your historic data, without having predicted anything yet. Also note that fittedvalues is a property (or attribute) of model.
model.predict() is a method of the model to actually predict unseen values.
Here I asked how to compute AIC in a linear model. If I replace LinearRegression() method with linear_model.OLS method to have AIC, then how can I compute slope and intercept for the OLS linear model?
import statsmodels.formula.api as smf
regr = smf.OLS(y, X, hasconst=True).fit()
In your example, you can use the params attribute of regr, which will display the coefficients and intercept. They key is that you first need to add a column vector of 1.0s to your X data. Why? The intercept term is technically just the coefficient to a column vector of 1s. That is, the intercept is just a coefficient which, when multiplied by an X "term" of 1.0, produces itself. When you add this to the summed product of the other coefficients and features, to get your nx1 array of predicted values.
Below is an example.
# Pull some data to use in the regression
from pandas_datareader.data import DataReader
import statsmodels.api as sm
syms = {'TWEXBMTH' : 'usd',
'T10Y2YM' : 'term_spread',
'PCOPPUSDM' : 'copper'
}
data = (DataReader(syms.keys(), 'fred', start='2000-01-01')
.pct_change()
.dropna())
data = data.rename(columns = syms)
# Here's where we assign a column of 1.0s to the X data
# This is required by statsmodels
# You can check that the resulting coefficients are correct by exporting
# to Excel with data.to_clipboard() and running Data Analysis > Regression there
data = data.assign(intercept = 1.)
Now actually running the regression and getting coefficients takes just 1 line in addition to what you have now.
y = data.usd
X = data.loc[:, 'term_spread':]
regr = sm.OLS(y, X, hasconst=True).fit()
print(regr.params)
term_spread -0.00065
copper -0.09483
intercept 0.00105
dtype: float64
So regarding your question on AIC, you'll want to make sure the X data has a constant there as well, before you call .fit.
Note: when you call .fit, you create a regression results wrapper and can access any of the attributes lists here.
For anyone searching on how to get the slope and intercept of a LinearRegression in scikit-learn: it has coef_ and intercept_ properties which show this.
(x, y) = np.random.randn(10,2).T
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(x.reshape(len(x), 1), y)
lr.coef_ # array([ 0.29387004])
lr.intercept_ # -0.17378418547919167
As we know, in logistic regression algorithm we predict one when theta times X is bigger than 0.5. I wanna raise the precision value. so i wanna change the predict function to predict 1 when theta times X is bigger than 0.7 or other values bigger than 0.5.
If i write the algorithm i could easily do it. But with sklearn package, i have no idea what to do.
Anyone can give me a hand?
To explain the question clearly enough, here is the predict function wroten in octave:
p = sigmoid(X*theta);
for i=1:size(p)(1)
if p(i) >= 0.6
p(i) = 1;
else
p(i) = 0;
endif;
endfor
The LogisticRegression predictor object from sklearn has a predict_proba method which outputs the probabilities that an input example belongs to a certain class. You can use this function along with your own defined theta times X to get the functionality you desire.
An example:
from sklearn import linear_model
import numpy as np
np.random.seed(1337) # Seed random for reproducibility
X = np.random.random((10, 5)) # Create sample data
Y = np.random.randint(2, size=10)
lr = linear_model.LogisticRegression().fit(X, Y)
prob_example_is_one = lr.predict_proba(X)[:, 1]
my_theta_times_X = 0.7 # Our custom threshold
predict_greater_than_theta = prob_example_is_one > my_theta_times_X
Here's the docstring for predict_proba:
Probability estimates.
The returned estimates for all classes are ordered by the
label of classes.
For a multi_class problem, if multi_class is set to be "multinomial"
the softmax function is used to find the predicted probability of
each class.
Else use a one-vs-rest approach, i.e calculate the probability
of each class assuming it to be positive using the logistic function.
and normalize these values across all the classes.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
T : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model,
where classes are ordered as they are in ``self.classes_``.
this works for both binary and multi-class classification:
from sklearn.linear_model import LogisticRegression
import numpy as np
#X = some training data
#y = labels for training data
#X_test = some test data
clf = LogisticRegression()
clf.fit(X, y)
predictions = clf.predict_proba(X_test)
predictions = clf.classes_[np.argmax(predictions > threshold, axis=1)]
I am using dataset to see the relationship between salary and college GPA. I am using sklearn linear regression model. I think the coefficients should be intercept and the coff. value of corresponding feature. But the model is giving a single value.
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
# Use only one feature : CollegeGPA
labour_data_gpa = labour_data[['collegeGPA']]
# salary as a dependent variable
labour_data_salary = labour_data[['Salary']]
# Split the data into training/testing sets
gpa_train, gpa_test, salary_train, salary_test = train_test_split(labour_data_gpa, labour_data_salary)
# Create linear regression object
regression = LinearRegression()
# Train the model using the training sets (first parameter is x )
regression.fit(gpa_train, salary_train)
#coefficients
regression.coef_
The output is : Out[12]: array([[ 3235.66359637]])
Try:
regression = LinearRegression(fit_intercept =True)
regression.fit(gpa_train, salary_train)
and the results will be in
regression.coef_
regression.intercept_
In order to get a better understanding of your linear regression, you maybe should consider another module, the following tutorial helps: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
salary_pred = regression.predict(gpa_test)
print salary_pred
print salary_test
I think salary_pred = regression.coef_*salary_test.
Have a try that printed salary_pred and salary_test via pyplot. Figure can explain every thing.
Here you are training your model on a single feature gpa and a target salary:
regression.fit(gpa_train, salary_train)
If you train your model on multiple features e.g. python_gpa and java_gpa (with the target as salary), then you would get two outputs signifying coefficients of the equation (for the linear regression model) and a single intercept.
It is equivalent to: ax + by + c = salary (where c is the intercept, a and b are the coefficients).