I am very new to time series modeling and statsmodels and trying to understand the AR model in statsmodels. Suppose I have a data record y of 1000 samples, and I fit an AR (1) model on y. Then I generate the in-sample prediction from this model as y_pred. I do this as
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(y,1).fit()
y_pred = model.predict()
I get the parameters of the model using model.params.
I would like to know, after estimating the model parameters, how does statsmodels calculate the in-sample predictions? For ex. how is y_pred[10] calculated?
I am sorry if the question is too basic, thanks for the help.
Per Wikipedia:
The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).
In your model example, you have one predictor - lagged value of y. In this simple case, the .predict() method multiplies each lagged value by the value of the estimated linear slope parameter for that predictor and adds the estimated value of the intercept of that line. So y_pred[10] will be equal to the product of the fitted slope parameter and y[9], with the value of the intercept estimate added.
Here is an example:
from statsmodels.tsa.ar_model import AutoReg
y = [1, 2, 3, 6, 2, 9, 1]
model = AutoReg(y,1).fit()
model.params
# array([ 5.72953737, -0.49466192])
The first value in the params array is the estimated intercept parameter and the second value is the estimated linear (slope) parameter.
y_pred = model.predict()
y_pred
# array([5.23487544, 4.74021352, 4.2455516 , 2.76156584, 4.74021352, 1.27758007])
The first value in the y_pred array is the predicted value for the second value in the y array. It is calculated as:
-0.49466192 * 1 + 5.72953737 = 5.23487544
The second value in the y_pred array is computed as:
-0.49466192 * 2 + 5.72953737 = 4.74021353
and so on...
Related
Im fitting some data for a classification task using Gaussian Process Classifiers in sklearn. I know that for the Gaussian Process Regressor one can pass return_std in
y_test, std = gp.predict(x_test, return_std=True)
to output the standard deviation of the test sample (like in this question)
However, I couldn't find such a parameter for the GP Classifier.
Is there such thing as outputting the predictive mean and stdv of test data from a GP Classifiers? And is there a way to output the posterior mean and covariance of the fitted model?
There is not standard deviation for categorical data, hence there is no the parameter return_std in the Classifier.
However, if you want to quantify the uncertainty of the classifier predictions, you could use the .predict_proba(X)method. Once you get the probabilites of each posible class you could compute the entropy of the predicted probabilities.
You could get the variance associated with the logit function by going to the predict_proba function definition in _gpc.py and returning the 'var_f_star' value. I have modified the predict_proba and created a function to return the logit variance below:
def predict_var(self, X):
"""Return probability estimates for the test vector X.
Parameters
----------
X : array-like of shape (n_samples, n_features) or list of object
Query points where the GP is evaluated for classification.
Returns
-------
C : array-like of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute ``classes_``.
"""
check_is_fitted(self)
# Based on Algorithm 3.2 of GPML
K_star = self.kernel_(self.X_train_, X) # K_star =k(x_star)
f_star = K_star.T.dot(self.y_train_ - self.pi_) # Line 4
v = solve(self.L_, self.W_sr_[:, np.newaxis] * K_star) # Line 5
# Line 6 (compute np.diag(v.T.dot(v)) via einsum)
var_f_star = self.kernel_.diag(X) - np.einsum("ij,ij->j", v, v)
I have a regression task: y = f(x),
most of values in y is zero, thus, if using mean square error (mse) as loss function, the model will give all predicted y as very small values;
So, I want to give larger weight to the non-zero values in y;
What should I do?
One solution I want to try is define a new loss function:
loss = e * mse(y, y_pred)[y!=0] + (1-e) * mse(y, y_pred)[y==0]
e is the weight parameter, will it work? How to implement in tensorflow?
if you want just larger values in your y tensor you can multiply your y tensor with a scalar variable
y=tf.math.multiply(y,10)
some cases models trained with mse predict very smooth outputs if you want sharper outputs you can use mean_pairwise_squared_error
loss=tf.losses.mean_pairwise_squared_error(y_pred,y,weights=e)
I generate a simple linear model in which X (dimension D) variables come from multi-normal with 0 covariance. Only the first 10 variables have true coefficients of 1, the rest have coefficients 0. Hence, theoretically, the ridge regression results should be the true coefficients divided by (1+C), where C is the penalty constant.
import numpy as np
from sklearn import linear_model
def generate_data(n):
d = 100
w = np.zeros(d)
for i in range(0,10):
w[i] = 1.0
trainx = np.random.normal(size=(n,d))
e = np.random.normal(size=(n))
trainy = np.dot(trainx, w) + e
return trainx, trainy
Then I use:
n = 200
x,y = generate_data(n)
regr = linear_model.Ridge(alpha=4,normalize=True)
regr.fit(x, y)
print(regr.coef_[0:20])
Under normalize = True, I get the first 10 coefficients to be somewhere 20% (i.e. 1/(1+4)) of the true value of 1. When normalize = False, I get the first 10 coefficients to be around 1, which are the same results as a simple linear regression model. Moreover, since I generate the data to be mean = 0 and std = 1, normalize = True shouldn't do anything as the data is already "normalized". Can someone explain to me what is going on here? Thanks!
It's important to understand that normalizing and standardizing are not the same and both cannot be done at the same time. You can either normalize or standardize.
Often Standardizing refers to transforming the data so that it has 0 mean and unit (1) variance. E.g. can be achieved by removing the mean and dividing by the standard deviation. In this case, this would be feature (column) wise.
Commonly Normalizing refers to transforming the data values to a range between 0 and 1. E.g. can be achieved by dividing by the length of the vector. But that doesn't mean that the mean is going to be 0 and the variance 1.
After generating trainx, trainy they're not not normalized yet. Maybe print it to see your results.
So, when normalize=True, trainx will be normalized by subtracting the mean and dividing by the l2-norm (according to sklearn).
When normalize=False, trainx will remain as is.
If you do normalize=True, every feature column is divided by its L2 norm, in other words, magnitude of every feature column is diminished, which causes the estimated coefficients to be larger (βX should be more or less constant; the smaller X, the larger β). When coefficients are larger, greater L2 penalty is imposed. The function thus places more focus on L2 penalty rather than the linear part (Xβ). The estimates of coefficients from the linear part, as a result, is not so accurate compared to pure linear regression.
By contrast, if normalize=False, X is bigger, β is smaller. Given the same alpha, L2 penalty is marginal. More focus is on linear part - the result is close to a pure linear regression.
I'm using hmmlearn's GaussianHMM to train a Hidden Markov Model with Gaussian observations. Each hidden state k has its corresponding Gaussian parameters: mu_k, Sigma_k.
After training the model, I would like to calculate the following quantity:
P(z_{T+1} = j | x_{1:T}),
where j = 1, 2, ... K, K is the number of hidden states.
The above probability is basically the one-step-ahead hidden state probabilities, given a full sequence of observations: x_1, x_2, ..., x_T, where x_i, i=1,...,T are used to train the HMM model.
I read the documentation, but couldn't find a function to calculate this probability. Is there any workaround?
The probability you are looking for is simply one row of the transition matrix. The n-th row of the transition matrix gives the probability of transitioning to each state at time t+1 knowing the state the system is at time t.
In order to know in which state the system is at time t given a sequence of observations x_1,...,x_t one can use the Viterbi algorithm which is the default setting of the method predict in hmmlearn.
model = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100) # Viterbi is set by default as the 'algorithm' optional parameter.
model.fit(data)
state_sequence = model.predict(data)
prob_next_step = model.transmat_[state_sequence[-1], :]
I suggest you give a closer look at this documentation that shows concrete use examples.
Once the an HMM model is trained, you can get the t+1 state given 1:t observations X as following:
import numpy as np
from sklearn.utils import check_random_state
sates = model.predict(X)
transmat_cdf = np.cumsum(model.transmat_, axis=1)
random_sate = check_random_state(model.random_state)
next_state = (transmat_cdf[states[-1]] > random_state.rand()).argmax()
the t+1 state is generated according to the t state and the transmat_
I'm implementing a multinomial logistic regression model in Python using Scikit-learn. Here's my code:
X = pd.concat([each for each in feature_cols], axis=1)
y = train[["<5", "5-6", "6-7", "7-8", "8-9", "9-10"]]
lm = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lm.fit(X, y)
However, I'm getting ValueError: bad input shape (50184, 6) when it tries to execute the last line of code.
X is a DataFrame with 50184 rows, 7 columns. y also has 50184 rows, but 6 columns.
I ultimately want to predict in what bin (<5, 5-6, etc.) the outcome falls. All the independent and dependent variables used in this case are dummy columns which have a binary value of either 0 or 1. What am I missing?
The Logistic Regression 3-class Classifier example illustrates how fitting LogisticRegression uses a vector rather than a matrix input, in this case the target variable of the iris dataset, coded as values [0, 1, 2].
To convert the dummy matrix to a series, you could multiply each column with a different integer, and then - assuming it's a pandas.DataFrame - just call .sum(axis=1) on the result. Something like:
for i, col in enumerate(y.columns.tolist(), 1):
y.loc[:, col] *= i
y = y.sum(axis=1)