I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script.
import numpy as np
import pandas as pd
import seaborn as sn
def get_data_XYE(p):
x_list = []
lower_list = []
upper_list = []
for line in p.lines:
x_list.append(line.get_xdata()[0])
lower_list.append(line.get_ydata()[0])
upper_list.append(line.get_ydata()[1])
y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list))
y_error = np.asarray(upper_list) - y
x = np.asarray(x_list)
return x, y, y_error
x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503]
y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25]
xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324]
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
# Check the bin average and std using agge
bins = pd.cut(df.x,xbreaks,right=False)
t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]})
t.reset_index(inplace=True)
t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut']
t.index.name ='id'
# Get the bin average from
g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed)
xye = pd.DataFrame(get_data_XYE(g)).T
xye.columns = ['x_regplot','y_regplot','e_regplot']
xye.index.name = 'id'
t2 = xye.merge(t,on='id',how='left')
t2
You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.
In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.
I am trying to model a time series in python using python 2.7.11 and the excellent statsmodels.tsa package. My data consists of hourly measurements of traffic intensity over several weeks. Thus, the data has multiple seasonal components, days form a 24 hour period; weeks form a 168 hour period.
At this point, the modeling options in statsmodels.tsa are not set up to handle multiple seasonality, as they only allow for the specification of one seasonal factor. However, I came across the work of Rob Hyneman on multiple seasonality in R. He advocates modeling seasonal components of a time series using Fourier series, including a Fourier series in the model for the frequencies corresponding to each of seasonal periods.
I've used Welch's method to obtain the power spectral density of the signal in my observed time series, extracted the peaks in the signal which correspond to the frequencies at which I expect my seasonal effects, and used the frequency and amplitude to generate a sine wave pattern corresponding to the seasonal trends I expect in my data. As an aside, I think this allows me to bypass Hyneman's step of selecting the value of k based on the AIC, because I am using the signal inherent in the observed data.
To ensure that the sine waves match the occurrence of the seasonal pattern in the data, I match the peak of both sine wave patterns to the peaks in the observed data by visually selecting a peak within one of the 24-hour periods, and matching the hour of its occurrence to the highest value of the variable representing the sine wave. Prior to this, I have checked that the daily peaks occur at the same hour consistently.
So far, so good it seems - plots of the sine waves constructed with the obtained frequencies and amplitudes roughly correspond to the observed data. I then fit an ARIMA(2,0,0) model, including both of the decomposition-based variables as exogenous variables. At this point, I want to test the predictive utility of the model. However, this is where things get complicated.
When I am using ARIMA from the statsmodels package, the estimates I get from fitting the model form a pattern which replicates the sine waves, but with a range of values matching my observation. There is still a lot of variance in the observations which is not explained by the seasonal trends, leading me to believe that somewhere in the model fitting procedure something is not going the way it is supposed to.
Unfortunately, I am not sufficiently well-versed in the art of time series modeling to know if my unexpected results are due to the nature of exogenous variables I am including, statsmodels functionality that I should be using, but am omitting, or wrongful assumptions about the concept of seasonal trends.
Some concrete questions I have are:
is it possible to include multiple seasonal trends (i.e. Fourier- or decomposition-based) in an ARIMA model using statsmodels in python?
could reconstruction of the seasonal trend using sine waves cause difficulties when the sine waves are included as exogenous variables in the model as specified above and in the code below?
why does the model specified int he code below not yield predictions which match the observed data more closely?
Any help is much appreciated!
Best wishes, and thanks in advance,
Evert
p.s.: Sorry if my code sample and data file are overly long - as I am not sure what causes the unexpected results I thought I'd post the whole thing. Also, apologies for not following PEP8 at times - I'm still learning :)
Code sample:
import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator
# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#
def plotmean(timeseries, show=0, path=''):
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue', label='Observed scores')
mean = plt.plot(rolmean, color='red', label='Rolling mean')
std = plt.plot(rolstd, color='black', label='Rolling SD')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#
def runwelch(dta, show, path):
nps = (len(dta) / 2) + 8
nov = nps / 2
fft = nps
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
plt.plot(f, Pxx_den)
plt.ylim([0.5e-7, 10])
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return f, Pxx_den
#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions
def getsines(n_obs, freq, density, n, show):
ftemp = freq
dtemp = density
fstore = []
dstore = []
astore = []
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600
for a in range(0, n, 1):
max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
dstore.append(max_value)
fstore.append(ftemp[max_index])
astore.append(np.sqrt(max_value))
dtemp[max_index] = 0
if show == 1:
for b in range(0, len(fstore), 1):
sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
plt.plot(sound_sine)
plt.show()
plt.clf()
return fstore, astore
def sine(freq, time_interval, rate, amp):
w = 2. * np.pi * freq
t = np.linspace(0, time_interval, time_interval * rate)
y = amp * np.sin(w * t)
return y
#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set
def buildFterms(dta, fstore, astore):
n = len(fstore)
n_obs = len(dta)
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600 + (24 * 3600)
# Add one excess day for later fitting of sine waves to peaks
store = []
for i in range(0, n, 1):
tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
store.append(tmp)
k_168_store = store[0]
k_24_store = store[1]
k_24 = np.transpose(k_24_store)
k_168 = np.transpose(k_168_store)
k_24 = pd.Series(k_24)
k_168 = pd.Series(k_168)
dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
# Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
# peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
# sine waves[6]
# print dta_ind, dta_val, k_24_ind, k_24_val
k_24_sel = k_24[6:1014]
k_168_sel = k_168[6:1014]
exog = pd.concat([k_24_sel, k_168_sel], axis=1)
return exog
#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#
def plotpacf(x, show=0, path=''):
dflength = len(x)
nlags = dflength * .80
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#
def dftest(dta):
print 'Results of Dickey-Fuller Test:'
dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if dfoutput[0] < dfoutput[4]:
dfoutput['Stationary'] = 'True'
else:
dfoutput['Stationary'] = 'False'
print dfoutput
#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def diffit(dta, d, show, path=''):
templist = []
for i in range(0, (len(dta) - d), 1):
tempval = dta[i] - dta[i + d]
templist.append(tempval)
y = templist[d:len(templist)]
y = pd.Series(y)
plotpacf(y, show, path)
return y
#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
results = mod.fit()
resids = results.resid.values
summarised = results.summary()
print summarised
plotpacf(resids, show, path)
return results
#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
dflength = len(datrng) - 1
observation = dta
prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
df = pd.concat([prediction, observation], axis=1)
df.columns = ['predicted', 'observed']
plt.plot(prediction)
plt.plot(observation)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df
#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#
def pred(pred_hours, k, df, arima, show=0, path=''):
n_obs = len(df.index)
lastdt = df.index[n_obs - 1]
lastdt = lastdt.to_datetime()
datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
future = pd.DataFrame(index=datrng, columns=df.columns)
df = pd.concat([df, future])
lendf = len(df.index)
df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
print df
marked = 2 * pred_hours
df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df[['predicted', 'observed']].ix[-marked:]
dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)
Data file