I am trying to model a time series in python using python 2.7.11 and the excellent statsmodels.tsa package. My data consists of hourly measurements of traffic intensity over several weeks. Thus, the data has multiple seasonal components, days form a 24 hour period; weeks form a 168 hour period.
At this point, the modeling options in statsmodels.tsa are not set up to handle multiple seasonality, as they only allow for the specification of one seasonal factor. However, I came across the work of Rob Hyneman on multiple seasonality in R. He advocates modeling seasonal components of a time series using Fourier series, including a Fourier series in the model for the frequencies corresponding to each of seasonal periods.
I've used Welch's method to obtain the power spectral density of the signal in my observed time series, extracted the peaks in the signal which correspond to the frequencies at which I expect my seasonal effects, and used the frequency and amplitude to generate a sine wave pattern corresponding to the seasonal trends I expect in my data. As an aside, I think this allows me to bypass Hyneman's step of selecting the value of k based on the AIC, because I am using the signal inherent in the observed data.
To ensure that the sine waves match the occurrence of the seasonal pattern in the data, I match the peak of both sine wave patterns to the peaks in the observed data by visually selecting a peak within one of the 24-hour periods, and matching the hour of its occurrence to the highest value of the variable representing the sine wave. Prior to this, I have checked that the daily peaks occur at the same hour consistently.
So far, so good it seems - plots of the sine waves constructed with the obtained frequencies and amplitudes roughly correspond to the observed data. I then fit an ARIMA(2,0,0) model, including both of the decomposition-based variables as exogenous variables. At this point, I want to test the predictive utility of the model. However, this is where things get complicated.
When I am using ARIMA from the statsmodels package, the estimates I get from fitting the model form a pattern which replicates the sine waves, but with a range of values matching my observation. There is still a lot of variance in the observations which is not explained by the seasonal trends, leading me to believe that somewhere in the model fitting procedure something is not going the way it is supposed to.
Unfortunately, I am not sufficiently well-versed in the art of time series modeling to know if my unexpected results are due to the nature of exogenous variables I am including, statsmodels functionality that I should be using, but am omitting, or wrongful assumptions about the concept of seasonal trends.
Some concrete questions I have are:
is it possible to include multiple seasonal trends (i.e. Fourier- or decomposition-based) in an ARIMA model using statsmodels in python?
could reconstruction of the seasonal trend using sine waves cause difficulties when the sine waves are included as exogenous variables in the model as specified above and in the code below?
why does the model specified int he code below not yield predictions which match the observed data more closely?
Any help is much appreciated!
Best wishes, and thanks in advance,
p.s.: Sorry if my code sample and data file are overly long - as I am not sure what causes the unexpected results I thought I'd post the whole thing. Also, apologies for not following PEP8 at times - I'm still learning :)
Code sample:
import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator
# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
def plotmean(timeseries, show=0, path=''):
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue', label='Observed scores')
mean = plt.plot(rolmean, color='red', label='Rolling mean')
std = plt.plot(rolstd, color='black', label='Rolling SD')
plt.title('Rolling Mean & Standard Deviation')
if show != 0:
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
# Output:
# frequency range and spectral density range
def runwelch(dta, show, path):
nps = (len(dta) / 2) + 8
nov = nps / 2
fft = nps
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
plt.plot(f, Pxx_den)
plt.ylim([0.5e-7, 10])
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
if show != 0:
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
return f, Pxx_den
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions
def getsines(n_obs, freq, density, n, show):
ftemp = freq
dtemp = density
fstore = []
dstore = []
astore = []
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600
for a in range(0, n, 1):
max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
dtemp[max_index] = 0
if show == 1:
for b in range(0, len(fstore), 1):
sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
return fstore, astore
def sine(freq, time_interval, rate, amp):
w = 2. * np.pi * freq
t = np.linspace(0, time_interval, time_interval * rate)
y = amp * np.sin(w * t)
return y
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set
def buildFterms(dta, fstore, astore):
n = len(fstore)
n_obs = len(dta)
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600 + (24 * 3600)
# Add one excess day for later fitting of sine waves to peaks
store = []
for i in range(0, n, 1):
tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
k_168_store = store[0]
k_24_store = store[1]
k_24 = np.transpose(k_24_store)
k_168 = np.transpose(k_168_store)
k_24 = pd.Series(k_24)
k_168 = pd.Series(k_168)
dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
# Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
# peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
# sine waves[6]
# print dta_ind, dta_val, k_24_ind, k_24_val
k_24_sel = k_24[6:1014]
k_168_sel = k_168[6:1014]
exog = pd.concat([k_24_sel, k_168_sel], axis=1)
return exog
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
def plotpacf(x, show=0, path=''):
dflength = len(x)
nlags = dflength * .80
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
if show != 0:
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
def dftest(dta):
print 'Results of Dickey-Fuller Test:'
dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if dfoutput[0] < dfoutput[4]:
dfoutput['Stationary'] = 'True'
dfoutput['Stationary'] = 'False'
print dfoutput
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
def diffit(dta, d, show, path=''):
templist = []
for i in range(0, (len(dta) - d), 1):
tempval = dta[i] - dta[i + d]
y = templist[d:len(templist)]
y = pd.Series(y)
plotpacf(y, show, path)
return y
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
results = mod.fit()
resids = results.resid.values
summarised = results.summary()
print summarised
plotpacf(resids, show, path)
return results
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
dflength = len(datrng) - 1
observation = dta
prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
df = pd.concat([prediction, observation], axis=1)
df.columns = ['predicted', 'observed']
if show != 0:
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
return df
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
def pred(pred_hours, k, df, arima, show=0, path=''):
n_obs = len(df.index)
lastdt = df.index[n_obs - 1]
lastdt = lastdt.to_datetime()
datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
future = pd.DataFrame(index=datrng, columns=df.columns)
df = pd.concat([df, future])
lendf = len(df.index)
df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
print df
marked = 2 * pred_hours
df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
if show != 0:
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
return df[['predicted', 'observed']].ix[-marked:]
dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
# Check for stationarity of data
# Time series can be considered stationary
# Establish frequencies and amplitudes with which to fit ARIMA model
# Decompose signal into frequency and amplitude
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)
Data file
I am writing code to remove plateau outliers from time series data. I proceeded after receiving advice to use np.diff, but there was a problem that it could not be recognized if it was not the same value.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=15):
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus)
plateaus = find_plateaus(test, min_length=5, tolerance = 0.02, smoothing=11)
plateaus = np.ravel(plateaus, order = 'A')
plateaus = plateaus.tolist()
test2['T&F'] = np.nan
for i in test2.index:
if i in plateaus:
test2.loc[i,['T&F']] = test2.loc[i,'data']
else :
test2.loc[i,['T&F']] = 0
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(test2.index, test2['data'], color='black', label = 'time_series')
ax.scatter(test2.index,test2['T&F'], color='red', label = 'D910')
Do you know any libraries or methods that can be used?
I want to recognize the parts marked in the picture below.
enter image description here
Still in progress, but found the answer.
First, make the np array multidimensional.
ex) time_step = 3
Then, using np.std(), find the standard deviation,
After checking, you can set the standard deviation range to recognize the included range.
I'm currently working on a trading strategy simulator that fits an ARIMA to stock return data, makes a next day prediction, then buys/sells based on that prediction. It continues to accumulate shares until a sell signal is generated, at which point the program will liquidate the accumulated position and begin again.
Right now, I specify an interval of dates, then the loop will start by fitting an ARIMA to the first 14 days of return data, making a prediction for day 15, acting on the prediction, then it will begin again with the first 15 days, fitting a new ARIMA. It will continue this until it gets to the end of the range of dates specified, with each new iteration adding the previous day's sample.
So, basically n increases by 1 for every iteration of the loop. I don't want this. I want it to repeatedly fit to an interval of a fixed length. For example, say I'm testing a strategy over 500 trading days. For the first iteration I want the loop to take the 50 days prior to day 1 of the specified interval and fit an ARIMA, and then trade in the same manner as before, but for the next iteration of the loop, I don't want it to fit to 51 days, I want to fit the 50 days prior to the current date every time.
Here's the start of the simulation function where the for-loop is specified. I can't seem to figure out how to change the loop to accomplish my goal. Any help would be greatly appreciated!!
def run_simulation(returns, prices, amt, order, thresh, verbose=True, plot=True):
if type(order) == float:
thresh = None
curr_holding = False
sum_list = []
events_list = []
sharpe_list = []
init_amt = amt
#go through dates
for date, r in tqdm (returns.iloc[14:].items(), total=len(returns.iloc[14:])):
#get data til just before current date
curr_data = returns[:date]
# check if using ARIMA from order
if type(order) == tuple:
#fit model
model = ARIMA(curr_data, order=order).fit()
#get forecast
pred = model.forecast()
float_pred = float(pred)
Here's the full script for context:
import yfinance as yf
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
import seaborn as sns
from tqdm import tqdm
import pandas as pd
from statsmodels.tools.sm_exceptions import ValueWarning, HessianInversionWarning, ConvergenceWarning
import warnings
#in practice do not supress these warnings, they carry important information about the status of your model
warnings.filterwarnings('ignore', category=ValueWarning)
warnings.filterwarnings('ignore', category=HessianInversionWarning)
warnings.filterwarnings('ignore', category=ConvergenceWarning)
tickerSymbol = 'SPY'
data = yf.Ticker(tickerSymbol)
prices = data.history(start='2021-01-01', end='2022-01-03').Close
returns = prices.pct_change().dropna()
def std_dev(data):
# Get number of observations
n = len(data)
# Calculate mean
mean = sum(data) / n
# Calculate deviations from the mean
deviations = sum([(x - mean)**2 for x in data])
# Calculate Variance & Standard Deviation
variance = deviations / (n - 1)
s = variance**(1/2)
return s
# Sharpe Ratio From Scratch
def sharpe_ratio(data, risk_free_rate=0):
# Calculate Average Daily Return
mean_daily_return = sum(data) / len(data)
print(f"mean daily return = {mean_daily_return}")
# Calculate Standard Deviation
s = std_dev(data)
# Calculate Daily Sharpe Ratio
daily_sharpe_ratio = (mean_daily_return - risk_free_rate) / s
# Annualize Daily Sharpe Ratio
sharpe_ratio = 252**(1/2) * daily_sharpe_ratio
return sharpe_ratio
def run_simulation(returns, prices, amt, order, thresh, verbose=True, plot=True):
if type(order) == float:
thresh = None
curr_holding = False
sum_list = []
events_list = []
sharpe_list = []
init_amt = amt
#go through dates
for date, r in tqdm (returns.iloc[14:].items(), total=len(returns.iloc[14:])):
#get data til just before current date
curr_data = returns[:date]
# check if using ARIMA from order
if type(order) == tuple:
#fit model
model = ARIMA(curr_data, order=order).fit()
#get forecast
pred = model.forecast()
float_pred = float(pred)
#if you predict a high enough return and not holding, buy stock
# order for random strat and tuple for ARIMA
if float_pred > thresh \
or (order == 'last' and curr_data[-1] > 0):
buy_price = prices.loc[date]
events_list.append(('b', date))
int_buy_price = int(buy_price)
curr_holding = True
if verbose:
print('Bought at $%s'%buy_price)
print('Predicted Return: %s'%round(pred,4))
print(f"Current holdings = {sum(sum_list)}")
#if you predict below the threshold return, sell the stock
if (curr_holding) and \
((type(order) == float and np.random.random() < order)
or (type(order) == tuple and float_pred < thresh)
or (order == 'last' and curr_data[-1] > 0)):
sell_price = prices.loc[date]
total_return = len(sum_list) * sell_price
ret = (total_return-sum(sum_list))/sum(sum_list)
amt *= (1+ret)
events_list.append(('s', date, ret))
curr_holding = False
if verbose:
print('Sold at $%s'%sell_price)
print('Predicted Return: %s'%round(pred,4))
print('Actual Return: %s'%(round(ret, 4)))
if verbose:
sharpe = sharpe_ratio(sharpe_list, risk_free_rate=0.004)
print('Total Amount: $%s'%round(amt,2))
print(f"Sharpe Ratio: {sharpe}")
if plot:
y_lims = (int(prices.min()*.95), int(prices.max()*1.05))
shaded_y_lims = int(prices.min()*.5), int(prices.max()*1.5)
for idx, event in enumerate(events_list):
plt.axvline(event[1], color='k', linestyle='--', alpha=0.4)
if event[0] == 's':
color = 'green' if event[2] > 0 else 'red'
event[1], events_list[idx-1][1], color=color, alpha=0.1)
tot_return = round(100*(amt / init_amt - 1), 2)
sharpe = sharpe_ratio(sharpe_list, risk_free_rate=0)
tot_return = str(tot_return) + '%'
plt.title("%s Price Data\nThresh=%s\nTotal Amt: $%s\nTotal Return: %s"%(tickerSymbol, thresh, round(amt,2), tot_return), fontsize=20)
return amt
# A model with a dth difference to fit and ARMA(p,q) model is called an ARIMA process
# of order (p,d,q). You can select p,d, and q with a wide range of methods,
# including AIC, BIC, and empirical autocorrelations (Petris, 2009).
for thresh in [0.001]:
run_simulation(returns, prices, 100000, (7,0,0), thresh, verbose=True)
curr_data = returns[:date]
curr_data_sliced = curr_data[-14:]
model=ARIMA(curr_data_sliced, ... )
Changing index for range of dates to use
e.g. [-50:] to incrementally train on 50 most recent data points
I have a dataset from which I have generated graphs. I am able to extract peaks from these graph which are above a threshold using scipy. I am trying to create a dataframe which contains peak features like peak value, peak width, peak height, slope of the curve that contains the peak, the number of points in the curve that contains the peak etc. I am struggling to find a way to extract the slope and number of points in the curve that contain peaks.
c_dict["L-04"][3][0] data is present in the paste bin link.
This is the code that I have tried for extracting some of the peak features.
def extract_peak_features(c_dict,households):
for key,value in c_dict.items():
if not key.startswith("L-01") and not key.startswith("H"):
for k,v in value.items():
if k==3:
if len(v) > 0:
if key in households:
smoking = 1
smoking = 0
peaks, _ = find_peaks(v[0],prominence=50)
half_widths = peak_widths(v[0], peaks, rel_height=0.5)[0]
widths = peak_widths(v[0], peaks, rel_height=1)[0]
if len(peaks) > 0:
smoke_list.extend([smoking] * len(peaks))
house_list.extend([key] * len(peaks))
data = {"ID":house_list,"peaks":peak_list,"width":width_list,"half_width":half_width_list,"smoke":smoke_list}
df_peak_stats = pd.DataFrame(data=data)
return df_peak_stats
df_peak_stats = extract_peak_features(c_dict,households)
A code for plotting c_dict["L-04"][3][0] data using scipy and matplotlib.
peaks, _ = find_peaks(c_dict["L-04"][3][0],prominence=50)
results_half = peak_widths(c_dict["L-04"][3][0], peaks, rel_height=0.5)
results_half[0] # widths
results_full = peak_widths(c_dict["L-04"][3][0], peaks, rel_height=1)
plt.plot(peaks, np.array(c_dict["L-04"][3][0])[peaks], "x")
#plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
In summary, I want to know how to extract the slope and number of points in the 4 curves above that contain the peaks.
Because the peaks in your data are localized, I created 4 subplots for each of the four peaks.
from scipy.signal import find_peaks,peak_widths
test = np.array(test)
test_inds = np.arange(len(test))
peaks, _ = find_peaks(test,prominence=50)
prominences, left_bases, right_bases = peak_prominences(test,peaks)
offset = np.ones_like(prominences)
# Calculate widths at x[peaks] - offset * rel_height
widths, h_eval, left_ips, right_ips = peak_widths(
test, peaks,
prominence_data=(offset, left_bases, right_bases)
in which test is the array in your post. The code above basically locates the peaks in the array, in order to find the two associated points you want:
The point to the left of a peak where the upward curve starts
The point to the right of the peak and its value is close to the point on the left
based on this post, you can use kneed.
fig,ax = plt.subplots(nrows=2,ncols=2,figsize=(18,10))
for ind,item in enumerate(zip(left_ips,right_ips)):
left_ip,right_ip = item
row_idx,col_idx = ind // 2,ind % 2
# This is where the peak locates
pc = np.array([int(left_ip)+1,test[int(left_ip)+1]])
# find the point where the curve starts to increase
# based on what your data look like, such a critical point can be found within the range
# test_inds[int(pc[0])-200: int(pc[0])], note that test_inds is an array of the inds of the points in your data
kn_l = KneeLocator(test_inds[int(pc[0])-200:int(pc[0])],test[int(pc[0])-200:int(pc[0])],curve='convex',direction='increasing')
kn_l = kn_l.knee
pl = np.array([kn_l,test[kn_l]])
# find the point to the right of the peak, the point is almost on the same level as the point on the left
# in this example, the threshold is set to 1
mask_zero = np.abs(test - pl[1]*np.ones(len(test))) < 1
mask_greater = test_inds > pc[0]
pr_idx = np.argmax(np.logical_and(mask_zero,mask_greater))
pr = np.array([pr_idx,test[pr_idx]])
get_angle = lambda v1, v2:\
np.rad2deg(np.arccos(np.clip(np.dot(v1, v2) / np.linalg.norm(v1) / np.linalg.norm(v2),-1,1)))
angle_l = get_angle(pr-pl,pc-pl)
angle_r = get_angle(pl-pr,pc-pr)
ax[row_idx][col_idx].annotate('%.2f deg' % angle_l,xy=pl+np.array([5,20]),xycoords='data',
ax[row_idx][col_idx].annotate('%.2f deg' % angle_r,xy=pr+np.array([-1,20]),xycoords='data',
rto_1 = (pc[1]-pl[1])/(pc[0]-pl[0])
rto_2 = (pc[1]-pr[1])/(pc[0]-pr[0])
ax[row_idx][col_idx].annotate('ratio1=%.3f' % rto_1,xy=pr+np.array([15,100]),xycoords='data',
ax[row_idx][col_idx].annotate('ratio2=%.3f' % rto_2,xy=pr+np.array([15,60]),xycoords='data',
pl_idx,pc_idx,pr_idx = pl[0].astype(np.int),pc[0].astype(np.int),pr[0].astype(np.int)
I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:
You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:
Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.
I’m trying to plot data an in order to check my code, I’m making a comparison of the resulting plots with what has already been generated with Matlab. I am encountering several issues however with this:
Generally, the parsing of RINEX files works, and the general pattern of the presentation of the data looks similar to that the Matlab scripts plotted. However there are small deviations in data that should become apparent when zooming in on the data i.e. when using a smaller time series, for example plotting over a special 2 hour period, not 24 hours. In Matlab, this small discrepancy can be seen, and a polynomial fitting applied. However for the Python plots (the first plot shown below), the curved line of this two hour period appears “smooth” and does not deviate at all, like that seen in the Matlab script (the second plot shows the blue line as the data, against the red line of the polyfit, hence, the blue line shows a slight discrepancy at x=9.4). The Matlab script is assumed correct, as this deviation is because of an Seismic activity that disrupts the ionosphere temporarily. Please refer to the plots below:
The third plot is in Matlab, where this is simply the polyfit minus the live data.
Therefore, it is not clear just how this data is being plotted on the axes for the Python script, because the data appears to smooth? Nor if my code is wrong (see below) and somehow “smooths” out the data somehow:
#Calculating by looping through
for sv in range(32):
sat = self.obs_data_chunks_dataframe[sv, :]
#print "sat.index_{0}: {1}".format(sv+1, sat.index)
phi1 = sat['L1'] * LAMBDA_1 #Change units of L1 to meters
phi2 = sat['L2'] * LAMBDA_2 #Change units of L2 to meters
pr1 = sat['P1']
pr2 = sat['P2']
#CALCULATION: teqc Calculation
iono_teqc = COEFF * (pr2 - pr1) / 1000000 #divide to make values smaller (tbc)
print "iono_teqc_{0}: {1}".format(sv+1, iono_teqc)
#Plotting of the data
plt.plot(sat.index, iono_teqc, label=‘teqc’)
plt.xlabel('Time (UTC)')
plt.ylabel('Ionosphere Delay (meters)')
plt.title("Ionosphere Delay on {0} for Satellite {1}.".format(self.date, sv+1))
ax = plt.gca()
if sys.platform.startswith('win'):
plt.savefig(winpath + '\Figure_SV{0}'.format(sv+1))
elif sys.platform.startswith('darwin'):
plt.savefig(macpath + 'Figure_SV{0}'.format(sv+1))
Following on from point 1, the polynomial fitting code below does not run the way I’d like, so I’m overlooking something here. I assume this has to do with the data used upon the x,y-axes but can’t pinpoint exactly what. Would anyone know where I am going wrong here?
#Zoomed in plots
if sv == 19:
#Plotting of the data
plt.plot(sat.index, iono_teqc, label=‘teqc’) #sat.index to plot for time in UTC
plt.xlim(8, 10)
plt.xlabel('Time (UTC)')
plt.ylabel('Ionosphere Delay (meters)')
plt.title("Ionosphere Delay on {0} for Satellite {1}.".format(self.date, sv+1))
ax = plt.gca()
#Polynomial fitting
coefficients = np.polyfit(sat.index, iono_teqc, 2)
if sys.platform.startswith('win'):
#os.path.join(winpath, 'Figure_SV{0}'.format(sv+1))
plt.savefig(winpath + '\Zoom_SV{0}'.format(sv+1))
elif sys.platform.startswith('darwin'):
plt.savefig(macpath + 'Zoom_SV{0}'.format(sv+1))
My RINEX file comprises 32 satellites. However when trying to generate the plots for all 32, I receive:
IndexError: index 31 is out of bounds for axis 0 with size 31
Changing the code below to 31 solves this partly, only excluding the 32nd satellite. I’d like to also plot for satellite 32. The functions for the parsing, and formatting of the data are given below:
def read_obs(self, RINEXfile, n_sat, sat_map):
obs = np.empty((TOTAL_SATS, len(self.obs_types)), dtype=np.float64) * np.NaN
lli = np.zeros((TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
signal_strength = np.zeros((TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
for i in range(n_sat):
# Join together observations for a single satellite if split across lines.
obs_line = ''.join(padline(RINEXfile.readline()[:-1], 16) for _ in range((len(self.obs_types) + 4) / 5))
#obs_line = ''.join(padline(RINEXfile.readline()[:-1], 16) for _ in range(2))
#while obs_line
for j in range(len(self.obs_types)):
obs_record = obs_line[16*j:16*(j+1)]
obs[sat_map[i], j] = floatornan(obs_record[0:14])
lli[sat_map[i], j] = digitorzero(obs_record[14:15])
signal_strength[sat_map[i], j] = digitorzero(obs_record[15:16])
return obs, lli, signal_strength
def read_data_chunk(self, RINEXfile, CHUNK_SIZE = 10000):
obss = np.empty((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.float64) * np.NaN
llis = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
signal_strengths = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
flags = np.zeros(CHUNK_SIZE, dtype=np.uint8)
i = 0 #ggfrfg
while True:
hdr = self.read_epoch_header(RINEXfile)
if hdr is None:
epoch_time, flags[i], sats = hdr
#epochs[i] = np.datetime64(epoch_time)
epochs[i] = epoch_time
sat_map = np.ones(len(sats)) * -1
for n, sat in enumerate(sats):
if sat[0] == 'G':
sat_map[n] = int(sat[1:]) - 1
obss[i], llis[i], signal_strengths[i] = self.read_obs(RINEXfile, len(sats), sat_map)
i += 1
if i >= CHUNK_SIZE:
return obss[:i], llis[:i], signal_strengths[:i], epochs[:i], flags[:i]
def read_data(self, RINEXfile):
obs_data_chunks = []
while True:
obss, _, _, epochs, _ = self.read_data_chunk(RINEXfile)
epochs = epochs.astype(np.int64)
epochs = np.divide(epochs, float(3600.000))
if obss.shape[0] == 0:
np.rollaxis(obss, 1, 0),
items=['G%02d' % d for d in range(1, 33)],
).dropna(axis=0, how='all').dropna(axis=2, how='all'))
self.obs_data_chunks_dataframe = obs_data_chunks[0]
Any suggestions?
Cheers, pymat.
I managed to solve Qu1 as it was a conversion issue with my calculation that was overlooked, the other two points are however open...