I am using ExponentialSmoothing from statsmodels to run Holt-Winters method on time series.
I get forecasted values but can not extract calculated values and compare them with observed values.
from pandas import Series
from scipy import stats
import statsmodels.api as sm
from statsmodels.tsa.api import ExponentialSmoothing
modelHW = ExponentialSmoothing(np.asarray(passtrain_df['n_passengers']), seasonal_periods=12, trend='add', seasonal='mul',).fit()
y_hat_avg['Holt_Winter'] = modelHW.forecast(prediction_size)
So here, prediction_size = number of forecasted datapoints (4 in my case)
passtrain_df is a dataframe with observations (140 datapoints) based on which Holt_Winter model is built (regression).
I can easily display 4 forecasted values.
How do I extract 140 calculated values?
Tried to use:
print(ExponentialSmoothing.predict(np.asarray(passtrain_df), start=0, end=139))
But I probably have a syntax error somewhere
Thank you!
Edit:
Replaced synthetic dataset with sample data from OP
Fixed function that builds new forecast period
Fixed x-axis date format as per OPs request
Answer:
If you're looking for calculated values within your estimation period, you should use modelHW.fittedvalues and not modelHW.forecast(). The latter will give you just what it says; forecasts. And it's pretty awesome. Let me show you how to do both things:
Plot 1 - Model within estimation period
Plot 2 - Forecasts
Code:
#imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from statsmodels.tsa.api import ExponentialSmoothing
import matplotlib.dates as mdates
#%%
#
# Load data
pass_df = pd.read_csv('https://raw.githubusercontent.com/dacatay/time-series-analysis/master/data/passengers.csv', sep=';')
pass_df = pass_df.set_index('month')
type(pass_df.index)
df = pass_df.copy()
# Model
modelHW = ExponentialSmoothing(np.asarray(df['n_passengers']), seasonal_periods=12, trend='add', seasonal='mul',).fit()
modelHW.summary()
# Model, fitted values
model_values = modelHW.fittedvalues
model_period = df.index
df_model = pd.concat([df['n_passengers'], pd.Series(model_values, index = model_period)], axis = 1)
df_model.columns = ['n_passengers', 'HWmodel']
df_model = df_model.set_index(pd.DatetimeIndex(df_model.index))
# Model, plot
fig, ax = plt.subplots()
myFmt = mdates.DateFormatter('%Y-%m')
df_model.plot(ax = ax, x_compat=True)
ax.xaxis.set_major_formatter(myFmt)
# Forecasts
prediction_size = 10
forecast_values = modelHW.forecast(prediction_size)
# Forecasts, build new period
forecast_start = df.index[-1]
forecast_start = pd.to_datetime(forecast_start, format='%Y-%m-%d')
forecast_period = pd.period_range(forecast_start, periods=prediction_size+1, freq='M')
forecast_period = forecast_period[1:]
# Forecasts, create dataframe
df_forecast = pd.Series(forecast_values, index = forecast_period.values).to_frame()
df_forecast.columns = ['HWforecast']
# merge input and forecast dataframes
df_all = pd.merge(df,df_forecast, how='outer', left_index=True, right_index=True)
#df_all = df_all.set_index(pd.DatetimeIndex(df_all.index.values))
ix = df_all.index
ixp = pd.PeriodIndex(ix, freq = 'M')
df_all = df_all.set_index(ixp)
# Forecast, plot
fig, ax = plt.subplots()
myFmt = mdates.DateFormatter('%Y-%m')
df_all.plot(ax = ax, x_compat=True)
ax.xaxis.set_major_formatter(myFmt)
Previous attempts:
# imports
import pandas as pd
import numpy as np
from statsmodels.tsa.api import ExponentialSmoothing
# Data that matches your setup, but with a random
# seed to make it reproducible
np.random.seed(42)
# Time
date = pd.to_datetime("1st of Jan, 2019")
dates = date+pd.to_timedelta(np.arange(140), 'D')
# Data
n_passengers = np.random.normal(loc=0.0, scale=5.0, size=140).cumsum()
n_passengers = n_passengers.astype(int) + 100
df = pd.DataFrame({'n_passengers':n_passengers},index=dates)
1. How to plot observed vs. estimated values within the estimation period:
The following snippet will extract all fitted values and plot it against your observed values.
Snippet 2:
# Model
modelHW = ExponentialSmoothing(np.asarray(df['n_passengers']), seasonal_periods=12, trend='add', seasonal='mul',).fit()
modelHW.summary()
# Model, fitted values
model_values = modelHW.fittedvalues
model_period = df.index
df_model = pd.concat([df['n_passengers'], pd.Series(model_values, index = model_period)], axis = 1)
df_model.columns = ['n_passengers', 'HWmodel']
df_model.plot()
Plot 1:
2. How to produce and plot model forecasts of a certain length:
The following snippet will produce 10 forecasts from your model, and plot it as an extended period compared to your observer values.
Snippet 3:
# Forecast
prediction_size = 10
forecast_values = modelHW.forecast(prediction_size)
forecast_period = df.index[-1] + pd.to_timedelta(np.arange(prediction_size+1), 'D')
forecast_period = forecast_period[1:]
df_forecast = pd.concat([df['n_passengers'], pd.Series(forecast_values, index = forecast_period)], axis = 1)
df_forecast.columns = ['n_passengers', 'HWforecast']
df_forecast.plot()
Plot 2:
And here's the whole thing for an easy copy&paste:
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from statsmodels.tsa.api import ExponentialSmoothing
# Data that matches your setup, but with a random
# seed to make it reproducible
np.random.seed(42)
# Time
date = pd.to_datetime("1st of Jan, 2019")
dates = date+pd.to_timedelta(np.arange(140), 'D')
# Data
n_passengers = np.random.normal(loc=0.0, scale=5.0, size=140).cumsum()
n_passengers = n_passengers.astype(int) + 100
df = pd.DataFrame({'n_passengers':n_passengers},index=dates)
# Model
modelHW = ExponentialSmoothing(np.asarray(df['n_passengers']), seasonal_periods=12, trend='add', seasonal='mul',).fit()
modelHW.summary()
# Model, fitted values
model_values = modelHW.fittedvalues
model_period = df.index
df_model = pd.concat([df['n_passengers'], pd.Series(model_values, index = model_period)], axis = 1)
df_model.columns = ['n_passengers', 'HWmodel']
df_model.plot()
# Forecast
prediction_size = 10
forecast_values = modelHW.forecast(prediction_size)
forecast_period = df.index[-1] + pd.to_timedelta(np.arange(prediction_size+1), 'D')
forecast_period = forecast_period[1:]
df_forecast = pd.concat([df['n_passengers'], pd.Series(forecast_values, index = forecast_period)], axis = 1)
df_forecast.columns = ['n_passengers', 'HWforecast']
df_forecast.plot()
#vestland - here is the code and error:
y_train = passtrain_df.copy(deep=True)
model_HW = ExponentialSmoothing(np.asarray(y_train['n_passengers']), seasonal_periods=12, trend='add', seasonal='mul',).fit()
model_values = model_HW.fittedvalues
model_period = y_train.index
hw_model = pd.concat([y_train['n_passengers'], pd.Series(model_values, index = model_period)], axis = 1)
hw_model.columns = ['Observed Passengers', 'Holt-Winters']
plt.figure(figsize=(18,12))
hw_model.plot()
forecast_values = model_HW.forecast(prediction_size)
forecast_period = y_train.index[-1] + pd.to_timedelta(np.arange(prediction_size+1),'D')
forecast_period = forecast_period[1:]
hw_forecast = pd.concat([y_train['n_passengers'], pd.Series(forecast_values, index = forecast_period)], axis = 1)
hw_forecast.columns = ['Observed Passengers', 'HW-Forecast']
hw_forecast.plot()
Error:
NullFrequencyError Traceback (most recent call last)
<ipython-input-25-5f37a0dd0cfa> in <module>()
17
18 forecast_values = model_HW.forecast(prediction_size)
---> 19 forecast_period = y_train.index[-1] + pd.to_timedelta(np.arange(prediction_size+1),'D')
20 forecast_period = forecast_period[1:]
21
/anaconda3/lib/python3.6/site- packages/pandas/core/indexes/datetimelike.py in __radd__(self, other)
879 def __radd__(self, other):
880 # alias for __add__
--> 881 return self.__add__(other)
882 cls.__radd__ = __radd__
883
/anaconda3/lib/python3.6/site- packages/pandas/core/indexes/datetimelike.py in __add__(self, other)
842 # This check must come after the check for np.timedelta64
843 # as is_integer returns True for these
--> 844 result = self.shift(other)
845
846 # array-like others
/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py in shift(self, n, freq)
1049
1050 if self.freq is None:
-> 1051 raise NullFrequencyError("Cannot shift with no freq")
1052
1053 start = self[0] + n * self.freq
NullFrequencyError: Cannot shift with no freq
Related
Given some measures I am trying to create a beta distribution. Given a max, min, mean and also an alpha and beta how do I call the beta.ppf or beta.pfd to generate a proper data set?
Working Sample
https://www.kaggle.com/iancoetzer/betaworking
Broken Sample
https://www.kaggle.com/iancoetzer/betaproblem
import matplotlib.pyplot as plt
from scipy.stats import beta
#
# Set the shape paremeters
#
a = 2.8754
b = 3.0300
minv = 82.292
maxv = 129.871
mean = 105.46
#
# Generate the value between
#
x = np.linspace(beta.ppf(minv, a, b),beta.ppf(maxv, a, b), 100)
#
# Plot the beta distribution
#
plt.figure(figsize=(7,7))
plt.xlim(0.7, 1)
plt.plot(x, beta.pdf(x, a, b), 'r-')
plt.title('Beta Distribution', fontsize='15')
plt.xlabel('Values of Random Variable X (0, 1)', fontsize='15')
plt.ylabel('Probability', fontsize='15')
plt.show()```
We managed to code a simple solution to compute and plot the Beta Distribution as follow: see the red beta curve.
Now we are trying to plot a Weibull distribution ...
#import libraries
import pandas as pd, numpy as np, gc, time, os, uuid, math, datetime
from joblib import Parallel, delayed
from numpy.random import default_rng
from scipy.stats import beta
from scipy import special
from scipy.stats import exponweib
import matplotlib.pyplot as plt
#sample parameters
low, high, mean, a, b, trials = 82.292, 129.871, 105.46, 2.8754, 3.0300, 10000
scale = (high-low)/6
#normal
normal_arr = np.random.normal(loc=mean, scale=scale, size=trials)
#triangular
triangular_arr = np.random.triangular(left=low, mode=mean, right=high, size=trials)
#log normal
mu = math.log(math.pow(mean,2) / math.sqrt(math.pow(scale,2) + math.pow(mean,2)))
sigma = math.sqrt(math.log(math.pow(scale,2)/(math.pow(mean,2)) + 1))
lognorm_arr = np.random.lognormal(mean=mu, sigma=sigma, size=trials)
#beta
beta_x = np.linspace(beta.ppf(0.0, a, b),beta.ppf(1, a, b), trials)
#by = beta.pdf(bx, a, b)
beta_arr = beta.ppf(beta_x, a, b, loc=low, scale=high - low)
#define binning(arr) method:
def binning(arr):
df = pd.DataFrame(arr)
df["Trial"] = range(1, len(df) + 1)
df[0] = df[0].astype(float)
df.rename(columns = {0: "Result"}, inplace=True)
minval = df["Result"].min()
maxval = df["Result"].max()
binCount = 100
bins = np.linspace(minval, maxval, binCount + 1)
labels = np.arange(1, binCount + 1)
df["bins"] = pd.cut(df["Result"], bins = bins, labels = labels, include_lowest = True)
dfBin = df.groupby(["bins"])["Result"].mean()
dfCount = df.groupby(["bins"])["Result"].count()
dfBin.replace(np.nan, 0.0, inplace=True)
dfCount.replace(np.nan, 0, inplace=True)
dfCount = pd.DataFrame(dfCount)
dfBin = pd.DataFrame(dfBin)
dfBin["bin"] = range(1, len(dfBin) + 1)
dfBin["Result"] = dfBin["Result"].astype(float)
df = pd.merge(dfBin, dfCount, left_index=True, right_index=True)
#Rename the resulting columns
df.rename(columns = {'Result_x':'Mean'}, inplace = True)
df.rename(columns = {'Result_y':'Trials'}, inplace = True)
return df
dfNormal = binning(normal_arr)
dfLog = binning(lognorm_arr)
dfTriangular = binning(triangular_arr)
dfBeta = binning(beta_arr)
dfWeibull = binning(wei_arr)
dfNormal.drop(dfNormal[dfNormal["Mean"] == 0].index, inplace=True)
dfLog.drop(dfLog[dfLog["Mean"] == 0].index, inplace=True)
dfTriangular.drop(dfTriangular[dfTriangular["Mean"] == 0].index, inplace=True)
dfBeta.drop(dfBeta[dfBeta["Mean"] == 0].index, inplace=True)
dfWeibull.drop(dfWeibull[dfWeibull["Mean"] == 0].index, inplace=True)
plt.plot(dfNormal["Mean"], dfNormal["Trials"], label="Normal")
plt.plot(dfLog["Mean"], dfLog["Trials"], label="Lognormal")
plt.plot(dfTriangular["Mean"], dfTriangular["Trials"], label="Triangular")
plt.plot(dfBeta["Mean"], dfBeta["Trials"], label="Beta")
plt.plot(dfWeibull["Mean"], dfWeibull["Trials"], label="Weibull")
plt.legend(loc='upper right')
plt.xlabel("R amount")
plt.ylabel("# Trials")
#plt.xlim(low, high)
plt.show()
The issue that I have is with a rather simple approach of forecasting time series in python using SARIMAX model and 2 variables:
endogenous: the one of interest.
exogenous: the one assumed to have some influence on the endogenous variable.
The example uses the daily values of BTC and ETH, where BTC is endogenous, and ETH is endogenous.
import datetime
import numpy
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
import pmdarima as pm
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from datetime import date
from math import sqrt
from dateutil.relativedelta import relativedelta
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools
from random import random
import yfinance as yf
plt.style.use('ggplot')
The method of fetching data is quite simple using yahoo.finance API as yf
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df1 = yf.download(ticker, period = 'max', interval = '1d')
df1.reset_index(inplace = True)
df1
This needs to be done manually - insert the name of the coin by hand (gives more freedom to the user in terms of what coins are combined).
Enter your ticker: BTC-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2014-09-17 465.864014 468.174011 452.421997 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 413.104004 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 384.532013 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 389.882996 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 393.181000 398.821014 398.821014 26580100
... ... ... ... ... ... ... ...
2677 2022-01-15 43101.898438 43724.671875 42669.035156 43177.398438 43177.398438 18371348298
2678 2022-01-16 43172.039062 43436.808594 42691.023438 43113.878906 43113.878906 17902097845
2679 2022-01-17 43118.121094 43179.390625 41680.320312 42250.550781 42250.550781 21690904261
2680 2022-01-18 42250.074219 42534.402344 41392.214844 42375.632812 42375.632812 22417209227
2681 2022-01-19 42365.046875 42462.070312 41248.902344 42142.539062 42142.539062 24763551744
2682 rows × 7 columns
So df1 is our exogenous data. Then the endogenous data are fetched in the same manner.
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df2 = yf.download(ticker, period = 'max', interval = '1d')
df2.reset_index(inplace = True)
df2
Enter your ticker: ETH-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984
... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144
1533 rows × 7 columns
Now is a merging step where the two datasets are aligned.
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
data = df2.merge(df1, on = 'Date', how = 'left')
which looks like this:
Date Open High Low Close_x Adj Close Volume Close_y
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984 7143.580078
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984 6618.140137
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992 6357.600098
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936 5950.069824
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984 6559.490234
... ... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078 43177.398438
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874 43113.878906
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617 42250.550781
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091 42375.632812
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144 42142.539062
1533 rows × 8 columns
I want to focus solely on the closing price of BTC and ETH:
X = data[['Close_y', 'Date']]
y = data['Close_x']
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42, shuffle = False)
# grid search
X_train = X_train.drop('Date', axis = 1)
X_test = X_test.drop('Date', axis = 1)
Look for the best grid:
# Define the p, d and q parameters to take any value between 0 and 3 (exclusive)
p = d = q = range(0, 1)
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
# put 12 in the 's' position as we have monthly data
pdqs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
### Run Grid Search ###
def sarimax_gridsearch(pdq, pdqs, maxiter=5):
ans = []
for comb in pdq:
for combs in pdqs:
try:
mod = SARIMAX(y_train, exog=X_train, order=comb, seasonal_order=combs)
output = mod.fit(maxiter=maxiter)
ans.append([comb, combs, output.bic])
print('SARIMAX {} x {}12 : BIC Calculated ={}'.format(comb, combs, output.bic))
except:
continue
# Find the parameters with minimal BIC value
# Convert into dataframe
ans_df = pd.DataFrame(ans, columns=['pdq', 'pdqs', 'bic'])
# Sort and return top 5 combinations
ans_df = ans_df.sort_values(by=['bic'], ascending=True)
print(ans_df)
ans_df = ans_df.iloc[0]
return ans_df['pdq'], ans_df['pdqs']
o, s = sarimax_gridsearch(pdq, pdqs)
Make the predictions
# future predictions
# create Exogenous variables
df1 = df1.reset_index()
df1 = df1.set_index('Date')
df1 = df1.sort_index()
li = []
ys = ['Close']
for i in ys:
a = df1[i]
train_set, test_set = np.split(a, [int(.80 * len(a))])
model = pm.auto_arima(train_set, stepwise=True, error_action='ignore',seasonal=True, m=7)
b = model.get_params()
order = b.get('order')
s_order = b.get('seasonal_order')
model = sm.tsa.statespace.SARIMAX(a,
order=order,
seasonal_order=s_order
)
model_fit = model.fit()
start_index = data.index.max().date()+ relativedelta(days=1)
end_index = date(start_index.year, start_index.month , start_index.day+10)
forecast = model_fit.predict(start=start_index, end=end_index)
#start_index = data.shape[0]
#end_index = start_index + 12
#forecast = model_fit.predict(start=start_index, end=end_index)
li.append(forecast)
df = pd.DataFrame(li)
df = df.transpose()
df.columns = ys
df = df.reset_index()
exo = df[['Close', 'index']]
exo = exo.set_index('index')
But when I try to make the future predictions based on exo, like this:
#fit the model
print(b, s)
model_best = SARIMAX(y,exog=X.drop(['Date'],1), order=o, seasonal_order=s)
model_fit = model_best.fit()
model_fit.summary()
model_fit.plot_diagnostics(figsize=(15,12))
start_index = data.shape[0]
end_index = start_index + 12
pred_uc = model_fit.forecast(steps=13, start_index = start_index, end_index = end_index, exog = exo)
future_df = pd.DataFrame({'pred' : pred_uc})
print('Forecast:')
print(future_df)
plt.rcParams["figure.figsize"] = (8, 5)
#data = data.set_index('time')
plt.plot(data['Close_x'],color = 'blue', label = 'Actual')
plt.plot(pred_uc, color = 'orange',label = 'Predicted')
plt.show()
I get this annoying error:
ValueError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py in _validate_out_of_sample_exog(self, exog, out_of_sample)
1757 try:
-> 1758 exog = exog.reshape(required_exog_shape)
1759 except ValueError:
ValueError: cannot reshape array of size 11 into shape (13,1)
ValueError: Provided exogenous values are not of the appropriate shape. Required (13, 1), got (11, 1).
Can someone explain where I am wrong or what steps I missed in this module?
Check the shape of the exo variable. If you are forecasting 13 steps, then you need to provide exog variables for each of those 13 steps. The error message is saying that you only provided exog variables for 11 steps. You can either provide a larger array to the exog argument, or you can change the forecast to be for 11 steps.
I'm plotting the counts of a variable grouped by time as a heatmap. However, when including both hour and minute, the counts are quite low so the resulting heatmap doesn't really provide any real insight. Is it possible to group the counts in a bigger block of time? I'm hoping to test some different periods (5, 10 mins).
I'm also hoping to plot time on the x-axis. Similar to the output attached.
import seaborn as sns
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M%:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
g = df2.groupby(['Hour','Min','Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df)
To deal with such cases, I think it would be easy to use data downsampling. It is also easy to change the thresholds. The axis labels in the output graph will need to be modified, but we recommend this method.
import seaborn as sns
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
df2.set_index('Time', inplace=True)
count_df = df2.resample('10min')['Count'].value_counts().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df.T)
The way you could achieve this is by creating a column with numbers that have repeating elements for the number of minutes.
For example:
minutes = 3
x = [0,1,2]
np.repeat(x, repeats=minutes, axis=0)
>>>> [0,0,0,1,1,1,2,2,2]
and then group your data using this column.
So your code would look like:
...
minutes = 5
x = [i for i in range(int(df2.shape[0]/5))]
df2['group'] = np.repeat(x, repeats=minutes, axis=0)
g = df2.groupby(['Min', 'Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)
I have made 2 functions, one for the cumulative logarithmic returns and the other for the total relative return.
Cumulative logarithmic returns:
# Cumulative logarithmic returns function:
def tlog_r(data, start, end):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
return tlog_return`
Total relative returns:
# Total relative returns function:
def tr_rel(data):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
tr_relative = copy.deepcopy(tlog_return)
for t in range(0,len(tr_relative)):
tr_relative[t] = 100*(np.exp(tr_relative[t])-1)
print(tr_relative)
return tr_relative`
I want to calculate them from a dataframe of a stock between 2 dates.
It doesn't give any error but if dates don't start in 2000, 2005 or 2011 it returns a dataframe full of NaNs except for the value in index [0].
Why is this happening? How can I solve it?
In case you need it, this is the part of the code where I call the functions:
from relative_returns_functions import tlog_r, tr_rel
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import copy
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
# Program
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
title = '{} {} price'.format(ticker, price) #Plot title
panel_data.plot(title=title)
# Data procesing
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,6))
comp_title = '{} returns comparation'.format(ticker)
fig.suptitle(comp_title)
sum_log_returns = tlog_r(panel_data, start_date, end_date)
ax1.plot(sum_log_returns.index, sum_log_returns, label=ticker)
ax1.set_ylabel('Cumulative log returns')
ax1.legend(loc='best')
tot_logreturns = tr_rel(panel_data)
ax2.plot(tot_logreturns.index, tot_logreturns, label=ticker)
ax2.set_ylabel('Total relative returns (%)')
ax2.legend(loc='best')
plt.show()
Here you have a minimal reproducible example, you will have to import the functions, pandas, numpy and copy.
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
sum_log_returns = tlog_r(panel_data, start_date, end_date)
print(sum_log_returns)
tot_logreturns = tr_rel(panel_data)
print(tot_logreturns)
I guess this is supposed to be simple.. But I cant seem to make it work.
I have some stock data
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data = np.random.rand(62)*100)
I am doing some analysis on it, this results of my drawing some lines on the graph.
And I want to plot a 45 line somewhere on the graph as a reference for lines I drew on the graph.
What I have tried is
x = df.tail(len(df)/20).index
x = x.reset_index()
x_first_val = df.loc[x.loc[0].date].adj_close
In order to get some point and then use slope = 1 and calculate y values.. but this sounds all wrong.
Any ideas?
Here is a possibility:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Get values for the time:
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
# get the timestamps in nanoseconds (since epoch)
timestamps_ns = index_range.astype(np.int64)
# convert it to a relative number of days (for example, could be seconds)
time_day = (timestamps_ns - timestamps_ns[0]) / 1e9 / 60 / 60 / 24
# Define y-data for a line:
slope = 3 # unit: "something" per day
something = time_day * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');
which gives:
edit - first answer, using dayofyear instead of the timestamps:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Define data for a line:
slope = 3 # unit: "something" per day
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
dayofyear = index_range.dayofyear # it will not work around the new year...
dayofyear = dayofyear - dayofyear[0]
something = dayofyear * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');