How to set order AND seasonal params for SARIMA - python

Hi can someone give me a dummy guide for setting the order params and seasonal order params in the SARIMA model from statsmodels?
Are those numbers obtained from the ACF and PACF plot? If yes, how do you get the numbers for AR and MA from those plots? I know the differencing(I) cannot be derived from those two plots, how should I decide whether I should set it to 0, 1, or 2?
Thank you

You can use:
pip install pmdarima
pmdarima
pip install pyramid-arima
Please look at:
pyramid-auto-arima
from pyramid.arima.stationarity import ADFTest
adf_test = ADFTest(alpha=0.05)
adf_test.is_stationary(series)
train, test = series[1:500], series[501:910]
train.shape
test.shape
plt.plot(train)
plt.plot(test)
plt.title("Pyramid")
plt.show()
It does:
arima_fit = statsmodels.tsa.SARIMAX(data_set, order = (1,0,1), seasonal_order = (0,1,0,50), trend = 'c').fit()
prediction = arima_fit.predict('start', 'end', dynamic = True)
About your question regarding to ACF and PACF
The ACF stands for Autocorrelation function, and the PACF for Partial
Autocorrelation function. Looking at these two plots together can help
us form an idea of what models to fit. Autocorrelation computes and
plots the autocorrelations of a time series. Autocorrelation is the
correlation between observations of a time series separated by k time
units.
Similarly, partial autocorrelations measure the strength of
relationship with other terms being accounted for, in this case other
terms being the intervening lags present in the model. For example,
the partial autocorrelaton at lag 4 is the correlation at lag 4,
accounting for the correlations at lags 1, 2, and 3. To generate these
plots in Minitab, we go to Stat > Time Series > Autocorrelation or
Stat > Time Series > Partial Autocorrelation. I've generated these
plots for our simulated data below:
Fitting-an-arima-model

Related

Prophet: Multiplicative seasonality scales down trend values

I am having strange effects when I change my Prophet model from additive to multiplicative seasonality:
While my predictions get a lot better, my trend seems to be multiplied down to about 10% of the expected values.
I would expect the trend to stay in the same value range. Am I wrong?
Example with additive seasonality:
proph_model = Prophet()
proph_model.fit(df)
Result as expected:
Example with multiplicative seasonality:
proph_model = Prophet( seasonality_mode="multiplicative" )
proph_model.fit(df)
Result with better prediction but strangely scaled down trend:
I am currently working with latest Prophet 1.1.1 on Python 3.10.6.
for additive seasonality your assumption is that you add components (trend + seasonality) - the same amplitude and/or frequency over time; use when magnitude of seasonality does not change in relation to time
for multiplicative seasonality your assumption is that you multiply components (trend * seasonality) - increasing or decreasing amplitude and/or frequency over time; use when magnitude of seasonality does change in relation to time
Having this, you can expect different behaviors of your model for different seasonality_mode settings. Only if the trend doesn't change much, then additive and multiplicative seasonalities give comparable results.
You can see more details of your model components using proph_model.plot_components(forecast), maybe this will make whole picture more clear for you.
Another way to have the overview, is to decompose the timeseries:
from statsmodels.tsa.seasonal import seasonal_decompose
df = df.set_index('ds')
result_add = seasonal_decompose(df, model='additive')
result_add.plot()
result_m = seasonal_decompose(df, model='multiplicative')
result_m.plot()
I think that anyway, the best way to choose optimal parameters is to run cross validation to compare prediction error with multiplicative vs additive seasonality_mode.

Why does statsmodels acf function give different answers to scipy's pearsonr?

I calculated the ACF at particular lags of a time series "manually" by shifting in Pandas and using scipy.stats.pearsonr(), but got an answer visibly different from what was shown in the ACF plot from statsmodels.graphics.tsaplots.plot_acf()
Looking into it more, I calculated the ACF with statsmodels.api.tsa.acf() and while the value there agrees with the ACF plot (as I'd expect, since that's what plot_acf() is plotting!) it's substantially different from the Pearson correlation. MWE (data file and Jupyter notebook, plus .py file with same code in case you prefer!) available at: https://github.com/MultiverseHG/acf_problem. Code also shown below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import pearsonr
# read in data
bus = pd.read_csv('cleaned_bus.csv', index_col='Month', parse_dates=True)
# compute first 12 lags of ACF with statsmodels
sm_acf = sm.tsa.acf(bus, nlags=12, fft=False)
print(sm_acf)
# compute ACF by shifting and using scipy.stats.pearsonr
# drop nulls that arise from shifting and corresponding rows of unshifted data
trimmed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).iloc[lag:]
trimmed = bus.riders.iloc[lag:]
corr = pearsonr(shifted, trimmed)[0] # [0] to grab r ([1] is p-value)
trimmed_acf.append(corr)
trimmed_acf = np.array(trimmed_acf)
print(trimmed_acf)
# not the same - how different are they?
print(sm_acf - trimmed_acf)
# maybe acf is filling in the missing values with zeroes?
# from looking at the source code, doesn't seem like it, but let's try
zeroed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).fillna(0)
corr = pearsonr(shifted, bus.riders)[0]
zeroed_acf.append(corr)
zeroed_acf = np.array(zeroed_acf)
print(zeroed_acf)
# different again!
print(sm_acf - zeroed_acf)
# so why does statsmodels give a different ACF than calculating directly with Pearson's r?

How to detect anomaly in a time series data(specifically) with trend and seasonality present in it?

I want to detect the outliers in a "time series data" which contains the trend and seasonality components. I want to leave out the peaks which are seasonal and only consider only the other peaks and label them as outliers. As I am new to time series analysis, Please assist me to approach this time series problem.
The coding platform is using is Python.
Attempt 1 : Using ARIMA model
I have trained my model and forecasted for the test data. Then being able to compute the difference between forecasted results with my actual values of test data then able to find out the outliers based on the variance observed.
Implementation of Auto Arima
!pip install pyramid-arima
from pyramid.arima import auto_arima
stepwise_model = auto_arima(train_log, start_p=1, start_q=1,max_p=3, max_q=3,m=7,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore', suppress_warnings=True,stepwise=True)
import math
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from sklearn.metrics import mean_squared_error
Split data into train and test-sets
train, test = actual_vals[0:-70], actual_vals[-70:]
Log Transformation
train_log, test_log = np.log10(train), np.log10(test)
Converting to list
history = [x for x in train_log]
predictions = list()
predict_log=list()
Fitting Stepwise ARIMA model
for t in range(len(test_log)):
stepwise_model.fit(history)
output = stepwise_model.predict(n_periods=1)
predict_log.append(output[0])
yhat = 10**output[0]
predictions.append(yhat)
obs = test_log[t]
history.append(obs)
Plotting
figsize=(12, 7)
plt.figure(figsize=figsize)
pyplot.plot(test,label='Actuals')
pyplot.plot(predictions, color='red',label='Predicted')
pyplot.legend(loc='upper right')
pyplot.show()
But I can detect the outliers only in test data. Actually, I have to detect the outliers for the whole time series data including the train data I am having.
Attempt 2 : Using Seasonal Decomposition
I have used the below code to split the original data into Seasonal, Trend, Residuals and can be seen in the below image.
from statsmodels.tsa.seasonal import seasonal_decompose
decomposed = seasonal_decompose()
Then am using the residual data to find out the outliers using boxplot since the seasonal and trend components were removed. Does this makes sense ?
Or is there any other simple or better approach to go with ?
You can:
in the 4th graph (residual plot) at "Attempt 2 : Using Seasonal Decomposition" try to check for extreme points and that may lead you to some anomalies in the seasonal series.
Supervised(if you have some labeled data): Do some classification.
Unsupervised: Try to predict the next value and create a confidence interval to check whether the prediction lays inside it or not.
You can try to calculate the relative extrema of data. using argrelextrema as shown here for example:
from scipy.signal import argrelextrema
x = np.array([2, 1, 2, 3, 2, 0, 1, 0])
argrelextrema(x, np.greater)
output:
(array([3, 6]),)
Some random data (My implementation of the above argrelextrema):

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Automatic Trend Detection for Time Series / Signal Processing

What are the good algorithms to automatically detect trend or draw trend line (up trend, down trend, no trend) for time series data? Appreciate if you can point me to any good research paper or good library in python, R or Matlab.
Ideally, the output from this algorithm will have 4 columns:
from_time
to_time
trend (up/down/no trend/unknown)
probability_of_trend or degree_of_trend
Thank you so much for your time.
I had a similar problem - wanted to do segmentation of the time series on segments with a similar trends. For that task, you can use trend-classifier Python library. It is pip installable (pip3 install trend-classifier).
Here is an example that gets the time series data from YahooFinance and performs analysis.
import yfinance as yf
from trend_classifier import Segmenter
# download data from yahoo finance
df = yf.download("AAPL", start="2018-09-15", end="2022-09-05", interval="1d", progress=False)
x_in = list(range(0, len(df.index.tolist()), 1))
y_in = df["Adj Close"].tolist()
seg = Segmenter(x_in, y_in, n=20)
seg.calculate_segments()
Now, you can plot the time series with trend lines and segment boundaries with:
seg.plot_segments()
You can inspect details about each segment (e.g. positive value for slope indicates up-trend and a negative down-trend). To see info about the segment with index 3:
from devtools import debug
debug(seg.segments[3])
You can have information about all segments in tabular form using Segmenter.segments.to_dataframe() method which produces Pandas DataFrame.
seg.segments.to_dataframe()
There is a parameter that controls the "generalization" factor, i.e. you can try to fit a trend line to a smaller range of time series - you will end up with a large number of segments, or you can go for the segments spanning a bigger part of the time series (more general trend line) and end up with a time series divided into fewer segments. To control that behavior, when initializing Segmenter() (e.g. Segmenter(x_in, y_in, n=20) use various values for n parameter. The larger n the generalization is stronger (fewer segments).
Disclaimer: I'm the author of the trend-classifier package.

Categories

Resources