SARIMAX simulation of possible paths - python

I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.
Here I use the standard demonstration from statsmodels page as a simpler example:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO
Fitting the model:
wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"
# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())
Creating simulation:
res.simulate(len(data), repetitions=10).plot();
Here is the history:
Here is the simulation:
The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?

When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.
Taking your code above, but using
initial = res.smoothed_state[:, 0]
res.simulate(len(data),
repetitions=10,
initial_state=initial).plot()
I get a plot that looks like
The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).
Other Notes
One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be
mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))
I used this model in the picture above to capture the positive trend in the data.

Related

Loading saved params (a Pandas series) into a Statsmodels state-space model

I'm building a dynamic factor model using the excellent python package statsmodels, and I would like to pickle an estimated parameter vector so I can build the model again later, and load those params into it. (C.f., this Notebook built by Chad Fulton: https://github.com/ChadFulton/tsa-notebooks/blob/master/dfm_coincident.ipynb.)
In the following block of code, initial parameters are estimated with mod.fit() (using the Powell algo) and then given back to mod.fit() to complete the estimation (using the EM algo) using the initial parameters as initial_res.params. (The latter is a Pandas Series.)
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
I would like to pickle res.params (again, a small Pandas Series, and a small disk footprint). Then later build the model from scratch again, and load my saved parameters into it without having to re-estimate the model. Anyone know how that can be done?
Examples I have seen suggest pickling the results object res, but that can be a pretty big save. Building it from scratch is pretty simple, but estimation takes a while. It may be that estimation starting from the saved optimal params is quicker; but still, that's pretty amateurish, right?
TIA,
Drew
You can use the smooth method on any state space model to construct a results object from specific parameters. In your example:
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
res.params.to_csv(...)
# ...later...
params = pd.read_csv(...)
res = mod.smooth(params)

How to generate the actual results of an IRF() function in python?

I am unable to generate the actual underlying values of the IRFs. See code of a simple VAR model.
import numpy as np
import statsmodels.tsa as sm
model = VAR(df_differenced.astype(float))
results = model.fit()
irf = results.irf(10)
I can generate the resulting IRF plots just fine with this code:
irf.plot(orth=False)
But, I can't generate the underlying values. I'd like to do so to have precise figures. Visually interpreting IRFs is not that accurate. Using the summary() did not provide me this information.
I would really appreciate some help. Thanks in advance.
You need to use the irfs property or cum_effects (cumulative irf). results.irf returns an IRAnalysis object. The documentation is below the standard where it should be.
import numpy as np
import statsmodels.tsa as sm
import pandas as pd
df = pd.DataFrame(np.random.standard_normal((300,3)))
model = VAR(df)
results = model.fit()
irf = results.irf(10)
print(irf.irfs)
print(irf.cum_effects)
You are close to the actual answer.
You can type
results.irf(10)
or try
results.impulse_responses(10)
It will give you a table with the actual point estimates from the VAR

Running same python code multiple times and getting inconsistent results

I am new to Python, so I am not sure if this problem is due to my inexperience or whether this is a glitch.
I am running this code multiple times on the same data (no random number generation) and getting different results. This has occurred with more than one variable so far, and obviously I cannot proceed with the analysis until I figure out which results are trustworthy. Here is a short sample of the results I have obtained after running the code four times. Why is there such a discrepancy between these outputs? I am puzzled and greatly appreciate your advice.
Linear Regression
from scipy.stats import linregress
import scipy.stats
from scipy.signal import welch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
part_022_o = pd.read_excel(r'C:\Users\Me\Desktop\Behavioral Data Processed\part_022_combined_other.xlsx')
distance_o = part_022_o["distance"]
fs = 200
f, Pwelch_spec = signal.welch(distance_o, fs=fs, window='hanning',nperseg=400, noverlap=200, scaling='density', average='mean')
log_f = np.log(f, where=f>0)
log_pwelch = np.log(Pwelch_spec, where=Pwelch_spec>0)
idx = np.isfinite(log_f) & np.isfinite(log_pwelch)
polynomial_coefficients = np.polyfit(log_f[idx],log_pwelch[idx],1)
print(polynomial_coefficients)
scipy.stats.linregress(log_f[idx], log_pwelch[idx])
Results First Attempt
[ 0.00324568 -2.82962602]
Results Second Attempt
[-2.70137164 6.97117509]
Results Third Attempt
[-2.70137164 6.97117509]
Results Fourth Attempt
[-2.28028005 5.53839502]
The same thing happens when I use scipy.stats.linregress().
Thank you,
Confused
Edit: full code added.
Also, the issue appears to be related to np.log(), since only the values of "log_f" array seem to be changing with the different outputs. It is hard to be certain that nothing else is changing (e.g. log_pwelch), but differences in output clearly correspond to differences in the first value of the "log_f" array.
Edit: I have narrowed the issue down to np.log(f, where=f>0). The first value in the f array is zero. According to the documentation of numpy log, "...Note that if an uninitialized out array is created via the default out=None, locations within it where the condition is False will remain uninitialized." Apparently this means that the value or variable is unpredictable and can vary from trial to trial, which is exactly what I am observing. Given my inexperience with Python, I am not sure what the best solution is (e.g. specifying the out-array in the log function, use a random seed, just note the regression coefficients whenever the value of zero is unchanged after log, etc.)
Try to use a random seed to reproduce results. Do this with the following code at the top of your program:
import numpy as np
np.random.seed(123) or any number you want
see here for more info: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html
A random seed ensures you get repeatable results when some part of your program is generating numbers at random.
Try finding out what the functions (np.polyfit(), np.log()) are actually doing using documentation.
This is standard practice for scikit-learn and ML to use a seed value.

Find the sum of the residuals

I am doing a hands on exercise of Poissons Regression of Stats with Python in Fresco Play.
Problem statement is like:
Load the R dataset Insurance from the MASS package.
Capture the data as a pandas dataframe.
Build a Poisson regression model with a log of an independent variable
Holders, and dependent variable Claims.
Fit the model with data, and find the sum of the residuals.
I am stuck with the last line i.e. Sum of Residuals
I used np.sum(model.resid). But answer is not accepted
Here is my code
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
INS_data = sm.datasets.get_rdataset('Insurance','MASS').data
model = smf.poisson('Claims ~ np.log(Holders)', INS_data).fit()
print(np.sum(model.resid))
I was running the code in Python2 which gave wrong answer but running it in Python3 gave the correct answer. I don't know the reason but code works perfectly in Python3
For residual, you can use the basic concept of residual i.e. actual - predicted.
Here is the code snippet.
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
Insurance = sm.datasets.get_rdataset('Insurance','MASS')
data = Insurance.data
data['Holders_'] = np.log(data['Holders'])
model = smf.poisson('Claims ~ Holders_',data).fit()
y_predicted = p.predict(data['Holders_'])
residual = (data['Claims']-y_predicted)
print(sum(residual))
output
After much serach, i came to know that it is expecting cumulative sum so use
np.cumsum(model.resid)
It will pass in Frescoplay

How to replace sts.LinearRegression with non-linear model for a Tensorflow Probability Structured Time Series model component

I'm creating a time-series forecasting model with external, controllable features similar to the "Forecasting Demand for Electricity" example found at https://medium.com/tensorflow/structural-time-series-modeling-in-tensorflow-probability-344edac24083. In order to model the influence of the external factors, I am using an sts.LinearRegression() as a component of my model, but those external factors are very non-linear in nature and it's causing unwanted negative predictions in my model.
I've tried creating (simpler) forecasting outside of TFP STS, and found that a RandomForestRegressor works much better a LinearRegressor for these external features. What I'd LIKE to do is replace the sts.LinearRegression() with an sts.RandomForestRegressor(), but that isn't available from the sts library. In fact, there are hardly any options available from the sts library: https://www.tensorflow.org/probability/api_docs/python/tfp/sts/LinearRegression
I've also tried converting my target variable to log form, but there are numerous instances of zeros (which are inf for log), and this doesn't turn out to be a useful transformation.
My model architecture for TFP STS looks something like this:
def build_model(observed_time_series):
season_effect = sts.Seasonal(
num_seasons = 4, num_steps_per_season = 13, observed_time_series = observed_time_series,
name = 'season_effect')
marketing_effect = sts.LinearRegression(
design_matrix = tf.stack([recent_publicity - np.mean(recent_publicity),
active_ad - np.mean(active_ad)], axis = -1),
name = 'marketing_effect')
autoregressive = sts.Autoregressive(order=1,
observed_time_series = observed_time_series,
name = 'autoregressive')
model = sts.Sum([season_effect,
marketing_effect,
autoregressive],
observed_time_series = observed_time_series)
return model
Where I want to change the "marketing_effect" component of the model to something non-linear.
Is my only option here to clone the TFP STS library and create a custom function to handle non-linear data with something like a Random Forest Regressor? Does anyone know of a better option?
I'm not familiar with the usage of random forests in sts models. Can you point to a system where this exists? The trick with tfp.sts is that the math all works out nice & analytically because everything is marginally gaussian. If we can make that work, I think we're definitely open to bringing in other models.

Categories

Resources