I am unable to generate the actual underlying values of the IRFs. See code of a simple VAR model.
import numpy as np
import statsmodels.tsa as sm
model = VAR(df_differenced.astype(float))
results = model.fit()
irf = results.irf(10)
I can generate the resulting IRF plots just fine with this code:
irf.plot(orth=False)
But, I can't generate the underlying values. I'd like to do so to have precise figures. Visually interpreting IRFs is not that accurate. Using the summary() did not provide me this information.
I would really appreciate some help. Thanks in advance.
You need to use the irfs property or cum_effects (cumulative irf). results.irf returns an IRAnalysis object. The documentation is below the standard where it should be.
import numpy as np
import statsmodels.tsa as sm
import pandas as pd
df = pd.DataFrame(np.random.standard_normal((300,3)))
model = VAR(df)
results = model.fit()
irf = results.irf(10)
print(irf.irfs)
print(irf.cum_effects)
You are close to the actual answer.
You can type
results.irf(10)
or try
results.impulse_responses(10)
It will give you a table with the actual point estimates from the VAR
Related
I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.
Here I use the standard demonstration from statsmodels page as a simpler example:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO
Fitting the model:
wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"
# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())
Creating simulation:
res.simulate(len(data), repetitions=10).plot();
Here is the history:
Here is the simulation:
The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?
When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.
Taking your code above, but using
initial = res.smoothed_state[:, 0]
res.simulate(len(data),
repetitions=10,
initial_state=initial).plot()
I get a plot that looks like
The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).
Other Notes
One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be
mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))
I used this model in the picture above to capture the positive trend in the data.
I am new to Python, so I am not sure if this problem is due to my inexperience or whether this is a glitch.
I am running this code multiple times on the same data (no random number generation) and getting different results. This has occurred with more than one variable so far, and obviously I cannot proceed with the analysis until I figure out which results are trustworthy. Here is a short sample of the results I have obtained after running the code four times. Why is there such a discrepancy between these outputs? I am puzzled and greatly appreciate your advice.
Linear Regression
from scipy.stats import linregress
import scipy.stats
from scipy.signal import welch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
part_022_o = pd.read_excel(r'C:\Users\Me\Desktop\Behavioral Data Processed\part_022_combined_other.xlsx')
distance_o = part_022_o["distance"]
fs = 200
f, Pwelch_spec = signal.welch(distance_o, fs=fs, window='hanning',nperseg=400, noverlap=200, scaling='density', average='mean')
log_f = np.log(f, where=f>0)
log_pwelch = np.log(Pwelch_spec, where=Pwelch_spec>0)
idx = np.isfinite(log_f) & np.isfinite(log_pwelch)
polynomial_coefficients = np.polyfit(log_f[idx],log_pwelch[idx],1)
print(polynomial_coefficients)
scipy.stats.linregress(log_f[idx], log_pwelch[idx])
Results First Attempt
[ 0.00324568 -2.82962602]
Results Second Attempt
[-2.70137164 6.97117509]
Results Third Attempt
[-2.70137164 6.97117509]
Results Fourth Attempt
[-2.28028005 5.53839502]
The same thing happens when I use scipy.stats.linregress().
Thank you,
Confused
Edit: full code added.
Also, the issue appears to be related to np.log(), since only the values of "log_f" array seem to be changing with the different outputs. It is hard to be certain that nothing else is changing (e.g. log_pwelch), but differences in output clearly correspond to differences in the first value of the "log_f" array.
Edit: I have narrowed the issue down to np.log(f, where=f>0). The first value in the f array is zero. According to the documentation of numpy log, "...Note that if an uninitialized out array is created via the default out=None, locations within it where the condition is False will remain uninitialized." Apparently this means that the value or variable is unpredictable and can vary from trial to trial, which is exactly what I am observing. Given my inexperience with Python, I am not sure what the best solution is (e.g. specifying the out-array in the log function, use a random seed, just note the regression coefficients whenever the value of zero is unchanged after log, etc.)
Try to use a random seed to reproduce results. Do this with the following code at the top of your program:
import numpy as np
np.random.seed(123) or any number you want
see here for more info: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html
A random seed ensures you get repeatable results when some part of your program is generating numbers at random.
Try finding out what the functions (np.polyfit(), np.log()) are actually doing using documentation.
This is standard practice for scikit-learn and ML to use a seed value.
I was trying to understand what exactly scipy.signal.resample is to do. Here is a simple code:
import numpy as np
from scipy.signal import resample
import matplotlib.pyplot as plt
t=np.linspace(1,4,10)
x=t
x_resampled=resample(x,10000)
plt.plot(x)
plt.figure()
plt.plot(x_resampled)
the input is
and the output of code is
however, I expected
Can you please tell me how can I settle scipy.signal.resample to obtain such a result?
Look at the documentation for resample (emphasis mine):
Because a Fourier method is used, the signal is assumed to be periodic.
On the non-periodic data you're using the above assumption doesn't hold. So a valid result should not be expected.
I am doing a hands on exercise of Poissons Regression of Stats with Python in Fresco Play.
Problem statement is like:
Load the R dataset Insurance from the MASS package.
Capture the data as a pandas dataframe.
Build a Poisson regression model with a log of an independent variable
Holders, and dependent variable Claims.
Fit the model with data, and find the sum of the residuals.
I am stuck with the last line i.e. Sum of Residuals
I used np.sum(model.resid). But answer is not accepted
Here is my code
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
INS_data = sm.datasets.get_rdataset('Insurance','MASS').data
model = smf.poisson('Claims ~ np.log(Holders)', INS_data).fit()
print(np.sum(model.resid))
I was running the code in Python2 which gave wrong answer but running it in Python3 gave the correct answer. I don't know the reason but code works perfectly in Python3
For residual, you can use the basic concept of residual i.e. actual - predicted.
Here is the code snippet.
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
Insurance = sm.datasets.get_rdataset('Insurance','MASS')
data = Insurance.data
data['Holders_'] = np.log(data['Holders'])
model = smf.poisson('Claims ~ Holders_',data).fit()
y_predicted = p.predict(data['Holders_'])
residual = (data['Claims']-y_predicted)
print(sum(residual))
output
After much serach, i came to know that it is expecting cumulative sum so use
np.cumsum(model.resid)
It will pass in Frescoplay
I'm fairly new to programming, but this problem happens in python and in excel as well.
I'm using the following formulas for the RC transfer function
s/(s+1) for High Pass
1/(s+1) for Low Pass
with s = jwRC
below is the code I used in python
from pylab import *
from numpy import *
from cmath import *
"""
Generating a transfer function for RC filters.
Importing modules for complex math and plotting.
"""
f = arange(1, 5000, 1)
w = 2.0j*pi*f
R=100
C=1E-5
hp_tf = (w*R*C)/(w*R*C+1) # High Pass Transfer function
lp_tf = 1/(w*R*C+1) # Low Pass Transfer function
plot(f, hp_tf) # plot high pass transfer function
plot(f, lp_tf, '-r') # plot low pass transfer function
xscale('log')
I can't post images yet so I can't show the plot. But the issue here is the cutoff frequency is different for each one. They should cross at y=0.707, but they actually cross at about 0.5.
I figure my formula or method is wrong somewhere, but I can't find the mistake can anyone help me out?
Also, on a related note, I tried to convert to dB scale and I get the following error:
TypeError: only length-1 arrays can be converted to Python scalars
I'm using the following
debl=20*log(hp_tf)
This is a classical example why you should avoid pylab and more generally imports of the form
from module import *
unless you know exactly what it does, since it hopelessly clutters the name space.
Using,
import matplotlib.pyplot as plt
import numpy as np
and then calling np.log and plt.plot etc. will solve your problem.
Furether explanations
What's happening here is that,
from pylab import *
defines a log function from numpy that operate on arrays (the one you want).
However, the later import,
from cmath import *
overwrites it with a version that only accepts scalars, hence your error.