Plot an infinite line between two pandas series points - python

I want to plot an infinite non ending line between two points that are in the form of a pandas series. I am able to successfully plot a standard line between the points, however I don't want the line to "end" and instead it should continue. Expanding on this I would also like to extract the values of this new infinite line to a new dataframe so that I can see what corresponding line value a given x value in has.
data = yf.download("AAPL", start="2021-01-01", interval = "1d").drop(columns=['Adj Close'])
data = data[30:].rename(columns={"Open": "open", "High": "high", "Low": "low", "Close": "close", "Volume": "volume"})
local_max = argrelextrema(data['high'].values, np.greater)[0]
local_min = argrelextrema(data['low'].values, np.less)[0]
highs = data.iloc[local_max,:]
lows = data.iloc[local_min,:]
highesttwo = highs["high"].nlargest(2)
lowesttwo = lows["low"].nsmallest(2)
fig = plt.figure(figsize=[10,7])
data['high'].plot(marker='o', markevery=local_max)
data['low'].plot(marker='o', markevery=local_min)
highesttwo.plot()
lowesttwo.plot()
plt.show()
Currently my plot looks like this:
How ever I want it to look like this as well as be able to get the values of the line for the corresponding x value.

This can be done in a few steps as shown in the following example where the lines are computed with element-wise operations (i.e. vectorized) using the slope-intercept form of the line equation.
The stock data has a frequency based on the opening dates of the stock exchange. This frequency is not automatically recognized by pandas, therefore the .plot method produces a plot with a continuous date for the x-axis and includes the days with no data. This can be avoided by setting the argument use_index=False so that the x-axis uses integers starting from zero instead.
The challenge is to then create nicely formatted tick labels. The following example attempts to imitate the pandas tick format by using list comprehensions to select the tick locations and format the labels. These will need to be adjusted if the date range is significantly lengthened or shortened.
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
from scipy.signal import argrelextrema # v 1.6.1
import yfinance as yf # v 0.1.54
# Import data
data = (yf.download('AAPL', start='2021-01-04', end='2021-03-15', interval='1d')
.drop(columns=['Adj Close']))
data = data.rename(columns={'Open': 'open', 'High': 'high', 'Low': 'low',
'Close': 'close', 'Volume': 'volume'})
# Extract points and get appropriate x values for the points by using
# reset_index for highs/lows
local_max = argrelextrema(data['high'].values, np.greater)[0]
local_min = argrelextrema(data['low'].values, np.less)[0]
highs = data.reset_index().iloc[local_max, :]
lows = data.reset_index().iloc[local_min, :]
htwo = highs['high'].nlargest(2).sort_index()
ltwo = lows['low'].nsmallest(2).sort_index()
# Compute slope and y-intercept for each line
slope_high, intercept_high = np.polyfit(htwo.index, htwo, 1)
slope_low, intercept_low = np.polyfit(ltwo.index, ltwo, 1)
# Create dataframe for each line by using reindexed htwo and ltwo so that the
# index extends to the end of the dataset and serves as the x variable then
# compute y values
# High
line_high = htwo.reindex(range(htwo.index[0], len(data))).reset_index()
line_high.columns = ['x', 'y']
line_high['y'] = slope_high*line_high['x'] + intercept_high
# Low
line_low = ltwo.reindex(range(ltwo.index[0], len(data))).reset_index()
line_low.columns = ['x', 'y']
line_low['y'] = slope_low*line_low['x'] + intercept_low
# Plot data using pandas plotting function and add lines with matplotlib function
fig = plt.figure(figsize=[10,6])
ax = data['high'].plot(marker='o', markevery=local_max, use_index=False)
data['low'].plot(marker='o', markevery=local_min, use_index=False)
ax.plot(line_high['x'], line_high['y'])
ax.plot(line_low['x'], line_low['y'])
ax.set_xlim(0, len(data)-1)
# Set major and minor tick locations
tks_maj = [idx for idx, timestamp in enumerate(data.index)
if (timestamp.month != data.index[idx-1].month) | (idx == 0)]
tks_min = range(len(data))
ax.set_xticks(tks_maj)
ax.set_xticks(tks_min, minor=True)
# Format major and minor tick labels
labels_maj = [ts.strftime('\n%b\n%Y') if (data.index[tks_maj[idx]].year
!= data.index[tks_maj[idx-1]].year) | (idx == 0)
else ts.strftime('\n%b') for idx, ts in enumerate(data.index[tks_maj])]
labels_min = [ts.strftime('%d') if (idx+3)%5 == 0 else ''
for idx, ts in enumerate(data.index[tks_min])]
ax.set_xticklabels(labels_maj)
ax.set_xticklabels(labels_min, minor=True)
plt.show()
You can find more examples of tick formatting here and here in Solution 1.
Date string format codes

Related

Having some problem to understand the x_bin in regplot of Seaborn

I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script.
import numpy as np
import pandas as pd
import seaborn as sn
def get_data_XYE(p):
x_list = []
lower_list = []
upper_list = []
for line in p.lines:
x_list.append(line.get_xdata()[0])
lower_list.append(line.get_ydata()[0])
upper_list.append(line.get_ydata()[1])
y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list))
y_error = np.asarray(upper_list) - y
x = np.asarray(x_list)
return x, y, y_error
x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503]
y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25]
xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324]
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
# Check the bin average and std using agge
bins = pd.cut(df.x,xbreaks,right=False)
t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]})
t.reset_index(inplace=True)
t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut']
t.index.name ='id'
# Get the bin average from
g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed)
xye = pd.DataFrame(get_data_XYE(g)).T
xye.columns = ['x_regplot','y_regplot','e_regplot']
xye.index.name = 'id'
t2 = xye.merge(t,on='id',how='left')
t2
You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.
In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.

Newbie stuck on Python Quandl ValueError

I am trying to calculate the Sharpe ratio with a set of stock symbols. The code works with the first 5 stock symbols, however, it stops working after 6 symbols.
I searched the document for dimension errors that could possibly be the ValueError message but I do not see any possibilities. I also searched Quandl and Google for the error I was getting but could not get a specific result.
If someone could please let me know what I am doing wrong that would be great. I am very new to coding.
# import needed modules
import quandl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# get adjusted closing prices of 5 selected companies with Quandl
quandl.ApiConfig.api_key = 'oskr4yzppjZwxgJ7zNra'
selected = ['TGT', 'AAPL', 'MSFT', 'FCN', 'TSLA', 'SPY', 'XLV', 'BRK.B', 'WMT', 'JPM']
data = quandl.get_table('WIKI/PRICES', ticker = selected,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '2009-1-1', 'lte': '2019-12-31'}, paginate=True)
# reorganize data pulled by setting date as index width
# columns of tickers and their corresponding adjusted prices
clean = data.set_index('date')
table = clean.pivot(columns='ticker')
# calculate daily and annual returns of the stocks
returns_daily = table.pct_change()
returns_annual = returns_daily.mean() * 250
# get daily and covariance of returns of the stock
cov_daily = returns_daily.cov()
cov_annual = cov_daily * 250
# empty lists to store returns, volatility and weights of imiginary portfolios
port_returns = []
port_volatility = []
sharpe_ratio = []
stock_weights = []
# set the number of combinations for imaginary portfolios
num_assets = len(selected)
num_portfolios = 50000
# set random seed for reproduction's sake
np.random.seed(101)
# populate the empty lists with each portfolios returns,risk and weights
for single_portfolio in range(num_portfolios):
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(weights, returns_annual)
volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
sharpe = returns / volatility
sharpe_ratio.append(sharpe)
port_returns.append(returns)
port_volatility.append(volatility)
stock_weights.append(weights)
# a dictionary for Returns and Risk values of each portfolio
portfolio = {'Returns': port_returns,
'Volatility': port_volatility,
'Sharpe Ratio': sharpe_ratio}
# extend original dictionary to accomodate each ticker and weight in the portfolio
for counter,symbol in enumerate(selected):
portfolio[symbol+' weight'] = [weight[counter] for weight in stock_weights]
# make a nice dataframe of the extended dictionary
df = pd.DataFrame(portfolio)
# get better labels for desired arrangement of columns
column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' weight' for stock in selected]
# reorder dataframe columns
df = df[column_order]
# find min Volatility & max sharpe values in the dataframe (df)
min_volatility = df['Volatility'].min()
max_sharpe = df['Sharpe Ratio'].max()
# use the min, max values to locate and create the two special portfolios
sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
min_variance_port = df.loc[df['Volatility'] == min_volatility]
# plot the efficient frontier with a scatter plot
plt.style.use('seaborn-dark')
df.plot.scatter(x='Volatility', y='Returns', c='Sharpe Ratio',
cmap='RdYlGn', edgecolors='black', figsize=(10, 8), grid=True)
plt.scatter(x=sharpe_portfolio['Volatility'], y=sharpe_portfolio['Returns'], c='red', marker='D', s=200)
plt.scatter(x=min_variance_port['Volatility'], y=min_variance_port['Returns'], c='blue', marker='D', s=200)
plt.xlabel('Volatility (Std. Deviation)')
plt.ylabel('Expected Returns')
plt.title('Efficient Frontier')
plt.show()
# print the details of the 2 special portfolios
print(min_variance_port.T)
print(sharpe_portfolio.T)
The error I am getting is this:
ValueError Traceback (most recent call last)
<ipython-input-8-3e66668bf017> in <module>
42 weights = np.random.random(num_assets)
43 weights /= np.sum(weights)
---> 44 returns = np.dot(weights, returns_annual)
45 volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
46 sharpe = returns / volatility
ValueError: shapes (10,) and (7,) not aligned: 10 (dim 0) != 7 (dim 0)

Waterfall chart python matplotlib

I am having a problem with waterfall. I took this chart from matplotlib site and added my own data frame with 2 simple columns with some integer numbers. My waterfall was produced but without numbers, just empty bars. I am a bit lost and I would appreciate any suggestions.
What I am trying to build is the custom waterfall that takes one dataframe with column names, values, and some values for filters like countries. I haven't found anything like that anywhere so I am trying to build my own.
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
from matplotlib.ticker import FuncFormatter;
dataset = pd.read_csv('waterfall_test_data.csv')
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = dataset['columns']
data = dataset['amount']
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(15, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')

Adding a 45 degree line to a time series stock data plot

I guess this is supposed to be simple.. But I cant seem to make it work.
I have some stock data
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data = np.random.rand(62)*100)
I am doing some analysis on it, this results of my drawing some lines on the graph.
And I want to plot a 45 line somewhere on the graph as a reference for lines I drew on the graph.
What I have tried is
x = df.tail(len(df)/20).index
x = x.reset_index()
x_first_val = df.loc[x.loc[0].date].adj_close
In order to get some point and then use slope = 1 and calculate y values.. but this sounds all wrong.
Any ideas?
Here is a possibility:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Get values for the time:
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
# get the timestamps in nanoseconds (since epoch)
timestamps_ns = index_range.astype(np.int64)
# convert it to a relative number of days (for example, could be seconds)
time_day = (timestamps_ns - timestamps_ns[0]) / 1e9 / 60 / 60 / 24
# Define y-data for a line:
slope = 3 # unit: "something" per day
something = time_day * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');
which gives:
edit - first answer, using dayofyear instead of the timestamps:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Define data for a line:
slope = 3 # unit: "something" per day
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
dayofyear = index_range.dayofyear # it will not work around the new year...
dayofyear = dayofyear - dayofyear[0]
something = dayofyear * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');

Including multiple seasonal terms in Python statsmodels.tsa ARIMA

I am trying to model a time series in python using python 2.7.11 and the excellent statsmodels.tsa package. My data consists of hourly measurements of traffic intensity over several weeks. Thus, the data has multiple seasonal components, days form a 24 hour period; weeks form a 168 hour period.
At this point, the modeling options in statsmodels.tsa are not set up to handle multiple seasonality, as they only allow for the specification of one seasonal factor. However, I came across the work of Rob Hyneman on multiple seasonality in R. He advocates modeling seasonal components of a time series using Fourier series, including a Fourier series in the model for the frequencies corresponding to each of seasonal periods.
I've used Welch's method to obtain the power spectral density of the signal in my observed time series, extracted the peaks in the signal which correspond to the frequencies at which I expect my seasonal effects, and used the frequency and amplitude to generate a sine wave pattern corresponding to the seasonal trends I expect in my data. As an aside, I think this allows me to bypass Hyneman's step of selecting the value of k based on the AIC, because I am using the signal inherent in the observed data.
To ensure that the sine waves match the occurrence of the seasonal pattern in the data, I match the peak of both sine wave patterns to the peaks in the observed data by visually selecting a peak within one of the 24-hour periods, and matching the hour of its occurrence to the highest value of the variable representing the sine wave. Prior to this, I have checked that the daily peaks occur at the same hour consistently.
So far, so good it seems - plots of the sine waves constructed with the obtained frequencies and amplitudes roughly correspond to the observed data. I then fit an ARIMA(2,0,0) model, including both of the decomposition-based variables as exogenous variables. At this point, I want to test the predictive utility of the model. However, this is where things get complicated.
When I am using ARIMA from the statsmodels package, the estimates I get from fitting the model form a pattern which replicates the sine waves, but with a range of values matching my observation. There is still a lot of variance in the observations which is not explained by the seasonal trends, leading me to believe that somewhere in the model fitting procedure something is not going the way it is supposed to.
Unfortunately, I am not sufficiently well-versed in the art of time series modeling to know if my unexpected results are due to the nature of exogenous variables I am including, statsmodels functionality that I should be using, but am omitting, or wrongful assumptions about the concept of seasonal trends.
Some concrete questions I have are:
is it possible to include multiple seasonal trends (i.e. Fourier- or decomposition-based) in an ARIMA model using statsmodels in python?
could reconstruction of the seasonal trend using sine waves cause difficulties when the sine waves are included as exogenous variables in the model as specified above and in the code below?
why does the model specified int he code below not yield predictions which match the observed data more closely?
Any help is much appreciated!
Best wishes, and thanks in advance,
Evert
p.s.: Sorry if my code sample and data file are overly long - as I am not sure what causes the unexpected results I thought I'd post the whole thing. Also, apologies for not following PEP8 at times - I'm still learning :)
Code sample:
import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator
# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#
def plotmean(timeseries, show=0, path=''):
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue', label='Observed scores')
mean = plt.plot(rolmean, color='red', label='Rolling mean')
std = plt.plot(rolstd, color='black', label='Rolling SD')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#
def runwelch(dta, show, path):
nps = (len(dta) / 2) + 8
nov = nps / 2
fft = nps
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
plt.plot(f, Pxx_den)
plt.ylim([0.5e-7, 10])
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return f, Pxx_den
#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions
def getsines(n_obs, freq, density, n, show):
ftemp = freq
dtemp = density
fstore = []
dstore = []
astore = []
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600
for a in range(0, n, 1):
max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
dstore.append(max_value)
fstore.append(ftemp[max_index])
astore.append(np.sqrt(max_value))
dtemp[max_index] = 0
if show == 1:
for b in range(0, len(fstore), 1):
sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
plt.plot(sound_sine)
plt.show()
plt.clf()
return fstore, astore
def sine(freq, time_interval, rate, amp):
w = 2. * np.pi * freq
t = np.linspace(0, time_interval, time_interval * rate)
y = amp * np.sin(w * t)
return y
#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set
def buildFterms(dta, fstore, astore):
n = len(fstore)
n_obs = len(dta)
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600 + (24 * 3600)
# Add one excess day for later fitting of sine waves to peaks
store = []
for i in range(0, n, 1):
tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
store.append(tmp)
k_168_store = store[0]
k_24_store = store[1]
k_24 = np.transpose(k_24_store)
k_168 = np.transpose(k_168_store)
k_24 = pd.Series(k_24)
k_168 = pd.Series(k_168)
dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
# Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
# peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
# sine waves[6]
# print dta_ind, dta_val, k_24_ind, k_24_val
k_24_sel = k_24[6:1014]
k_168_sel = k_168[6:1014]
exog = pd.concat([k_24_sel, k_168_sel], axis=1)
return exog
#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#
def plotpacf(x, show=0, path=''):
dflength = len(x)
nlags = dflength * .80
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#
def dftest(dta):
print 'Results of Dickey-Fuller Test:'
dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if dfoutput[0] < dfoutput[4]:
dfoutput['Stationary'] = 'True'
else:
dfoutput['Stationary'] = 'False'
print dfoutput
#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def diffit(dta, d, show, path=''):
templist = []
for i in range(0, (len(dta) - d), 1):
tempval = dta[i] - dta[i + d]
templist.append(tempval)
y = templist[d:len(templist)]
y = pd.Series(y)
plotpacf(y, show, path)
return y
#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
results = mod.fit()
resids = results.resid.values
summarised = results.summary()
print summarised
plotpacf(resids, show, path)
return results
#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
dflength = len(datrng) - 1
observation = dta
prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
df = pd.concat([prediction, observation], axis=1)
df.columns = ['predicted', 'observed']
plt.plot(prediction)
plt.plot(observation)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df
#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#
def pred(pred_hours, k, df, arima, show=0, path=''):
n_obs = len(df.index)
lastdt = df.index[n_obs - 1]
lastdt = lastdt.to_datetime()
datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
future = pd.DataFrame(index=datrng, columns=df.columns)
df = pd.concat([df, future])
lendf = len(df.index)
df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
print df
marked = 2 * pred_hours
df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df[['predicted', 'observed']].ix[-marked:]
dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)
Data file

Categories

Resources