Difference between predict and fittedvalue in statsmodel - python

I have a very basic question, which I can somehow not find a real answer for.
Assuming I have a model:
import statsmodels.formula.api as smf
model = smf.ols(....).fit()
What is the difference between model.fittedvalues and model.predict ?

model.predict is a method for predicting values, so you can provide it an unseen dataset:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2),columns=['X','Y'])
model = smf.ols('Y ~ X',data=df).fit()
model.predict(exog=pd.DataFrame({'X':[1,2,3]}))
If you do not provide the exog argument, it returns the prediction by calling the data stored under the object, you see this under the source code:
def predict(self, params, exog=None):
"""
Return linear predicted values from a design matrix.
Parameters
----------
params : array_like
Parameters of a linear model.
exog : array_like, optional
Design / exogenous data. Model exog is used if None.
Returns
-------
array_like
An array of fitted values.
Notes
-----
If the model has not yet been fit, params is not optional.
"""
# JP: this does not look correct for GLMAR
# SS: it needs its own predict method
if exog is None:
exog = self.exog
return np.dot(exog, params)
On the other hand, model.fittedvalues is a property and it is the fitted values that are stored. It will be exactly the same as model.predict() for reasons explain above.
You can look at the methods for this type too.

When calling smf.ols(....).fit(), you fit your model to the data. I.e. for every data point in your data set, the model tries to explain it and computes a value for it. At this point, the model only tried to explain your historic data, without having predicted anything yet. Also note that fittedvalues is a property (or attribute) of model.
model.predict() is a method of the model to actually predict unseen values.

Related

OLS fit for python with coefficient error and transformed target

There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?
In stats model it would be done like this
import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())
To return the fit with the coefficients and the error on them
For Sklearn you can use the TransformedTargetRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)
But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?
EDIT
I found a good example for the special case I care about here
https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm
Just to add a lengthy comment here, I believe that TransformedTargetRegressor does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict method is called. It does not express the coefficients in units of the untransformed outcome.
Example:
import pandas as pd
import statsmodels.api as sm
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
create some sample data:
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names
X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
Sklearn first:
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
Statsmodels on transformed outcome:
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)
#const -0.458678
#sepal length (cm) 0.356758
#sepal width (cm) -0.296294
#dtype: float64
You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor yields the same coefficients as statsmodels.OLS with the transformed outcome. TransformedTargetRegressor does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.
What to do instead?
If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).
Example:
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] += 0.018 # this is one sigma
# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
return inv_func(data.dot(coeffs))
# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)
Then you can say e.g. about the mean shift in the untransformed outcome:
variant1.mean()-original.mean()
#0.2523083548367202
In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error #grisaitis provided a great answer explaining the main concepts behind it.
If you only want to use a plug-and-play function that will work with sciait-learn you can use this:
def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
y_true: 'np.ndarray', X: 'np.ndarray'):
"""Function that calculates the standard deviation of the coefficients of
a linear regression.
Parameters
----------
reg : sklearn.linear_model.LinearRegression
LinearRegression object which has been fitted
y_true : np.ndarray
array containing the target variable
X : np.ndarray
array containing the features used in the regression
Returns
-------
beta_std
Standard deviation of the regression coefficients
"""
y_pred = reg.predict(X) # get predictions
errors = y_true - y_pred # calculate residuals
sigma_sq_hat = np.var(errors) # calculate residuals std error
sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T # X)
return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances

Getting completely different weight values when using sklearn.linear_model.SGDClassifier with different random_state value for Logistic Regression

I believe, the weight should change slightly with different random state.
What could be the reason for getting different weights at every run with random_state = None
Following are the weights value for few runs( contains 3 features)
1)4.67100318,1.26129186,17.26554955
2)3.39793468,2.10265234,18.42484435
3)-2.08082186,1.25948975,10.37120852
4)3.71122156,0.93510126,16.63007864
Because of this fluctuations, I am not sure which random_state should I use and this is creating trouble while performing feature selection.
Please note that I am using data after performing standardisation.
I am using very simple code as below to train my model, as my data contain only 200 rows of data with 3 features
from sklearn.linear_model import SGDClassifier
SGDClf = SGDClassifier(loss='log',random_state=1)
SGDClf.fit(X,Y)
Machine learning models will produce different results on same dataset, random_state = None,
the models generate a sequence of random numbers called random seed used within the process of generating test, validation and training datasets from a given dataset, ex:random_state = 1.
Configurating a model's seed to a set value will ensure that the (weight) results are reproducible.
SGDClassifier() shuffles the entered data:
The passed (random state) value will have an effect on the reproducibility of the
results returned by the function (fit, split, or any other function
like k_means). - random state doc
Hope it is helpful

StatsModels SARIMAX with exogenous variables - how to extract exogenous coefficients

I fit a statsmodels SARIMAX model to my data, leveraging some exogenous variables.
How to extract the fitted regression parameters for the exogenous variables? It is clear per documentation how to get AR, MA coefficients, but nothing about exog coefficients. Any advice?
Code snippet below:
#imports
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
#X and Y variables, index as dates, X has several columns with exog variables
X = df[factors]
Y = df[target]
#lets fit it
model= SARIMAX(endog=Y[:'2020-04-13'], exog = X[:'2020-04-13'], order = (5,2,1))
#fit the model
model_fit = model.fit(disp=0)
#get AR coefficients
model_fit.polynomial_ar
There isn't a specific attribute for this, but you can always access all parameters using the model_fit.params attribute.
For the SARIMAX model, the exog parameters are always right after any trend parameters, so the following should always work:
exog_params = model_fit.params[model.k_trend:model.k_trend + model.k_exog]

Get prediction and distance with Scikit KNeighborsClassifier

According to the doc, Scikit's KNeighborsClassifier offers these two methods to get predictions:
predict(X) : Returns class labels.
kneighbors(X) : Returns distances and indices of the nearest points in the training data.
I'm in need of a mix of both: Getting the class label and the distance of that prediction. I'd like to avoid having to lookup the training data when using the kneighbors method (which returns only the index). Any way to do that?
After you get the indices from kneighbors(X), you can directly lookup the class label for each of those indices as such:
class_label = clf.classes_[clf._y[index]]

model selection for GaussianMixture by using GridSearch

I'd like to use the function GaussianMixture by scikit-learn, and I have to perform model selection.
I want to do it by using GridSearchCV, and I would like to use for the selection the BIC and the AIC.
Both these values are implemented into GaussianMixture(), but I don't know how to insert them into the definition of my custom scorer, since the function
make_scorer(score_func, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)
that I am using to create my custom scorer takes as input a function score_funct, that has to be defined as
score_func(y, y_pred, **kwargs)
Can someone help me?
Using the BIC/AIC is an alternative to using cross validation. GridSearchCV selects models using cross validation. To perform model selection using the BIC/AIC we have to do something a little different. Let's take an example where we generate samples from two Gaussians, and then try to fit them using scikit-learn.
import numpy as np
X1 = np.random.multivariate_normal([0.,0.],[[1.,0.],[0.,1.]],10000)
X2 = np.random.multivariate_normal([10.,10.],[[1.,0.],[0.,1.]],10000)
X = np.vstack((X1,X2))
np.random.shuffle(X)
Method 1: Cross-validation
Cross validation involves splitting the data into pieces. One then fits the model on some of the pieces ('training') and tests how well it performs on the remaining pieces ('validating'). This guards against over-fitting. Here we will use two-fold cross validation, where we split the data in half.
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
#check 1->4 components
tuned_parameters = {'n_components': np.array([1,2,3,4])}
#construct grid search object that uses 2 fold cross validation
clf = GridSearchCV(GaussianMixture(),tuned_parameters,cv=2)
#fit the data
clf.fit(X)
#plot the number of Gaussians against their rank
plt.scatter(clf.cv_results_['param_n_components'],\
clf.cv_results_['rank_test_score'])
We can see that 2-fold cross validation favours two Gaussian components, as we expect.
Method 2: BIC/AIC
Instead of using cross-validation, we can evaluate the BIC using the best-fit model given each number of Gaussians. We then choose the model that has the lowest BIC. The procedure would be identical if one used the AIC (although it is a different statistic, and can provide different answers: but your code structure would be identical to below).
bic = np.zeros(4)
n = np.arange(1,5)
models = []
#loop through each number of Gaussians and compute the BIC, and save the model
for i,j in enumerate(n):
#create mixture model with j components
gmm = GaussianMixture(n_components=j)
#fit it to the data
gmm.fit(X)
#compute the BIC for this model
bic[i] = gmm.bic(X)
#add the best-fit model with j components to the list of models
models.append(gmm)
After carrying out this procedure, we can plot the number of Gaussians against the BIC.
plt.plot(n,bic)
So we can see that the BIC is minimised for two Gaussians, so the best model
according to this method also has two components.
Because I took 10000 samples from two very well-separated Gaussians (i.e. the distance between their centres is much larger than either of their dispersions), the answer was very clear-cut. This is not always the case, and often neither of these methods will confidently tell you which number of Gaussians to use, but rather some sensible range.

Categories

Resources