I want to create a new frequency to assign to a pandas.DateTimeIndex. This is a dekad frequency where there are 36 periods in a year. Three per month. The first is always on the 10th day of the month, the second the 20th day of the month, and the final is the final day of that month.
The difficulty is that the final day of the month:
differs in February depending on whether it's a leap year (28th or 29th)
differs depending on the number of days in that month (28, 29, 30, 31)
Ultimately, however, it is a set frequency (3 per month, 36 periods per year).
The reason is that statsmodels.tsa.holtwinters models require indexes with a given frequency to make forecasts. When I try to run the holtwinters forecast I get the following warning message:
/home/tommy/miniconda3/envs/ml/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:216: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
This is what dekad timesteps look like:
from pandas.tseries.offsets import MonthEnd
dates = pd.date_range("2000-01-01", "2003-01-01")
_dekads = [d for d in dates if d.day in [10, 20]]
_month_ends = [d + MonthEnd(1) for d in dates if d.day == 10]
dekads = sorted(np.concatenate([_dekads, _month_ends]))
I want to be able to assign a dekad frequency to the index
df = pd.DataFrame({"y": np.random.random(len(dekads))}, index=dekads)
df.head()
Out[]:
y
2000-01-10 0.013236
2000-01-20 0.430563
2000-01-31 0.028183
2000-02-10 0.050080
2000-02-20 0.092100
I'd like to be able to assign a "dekad" frequency to the object. How can I create my own dekad frequency?
df.index.freq = "dekad"
Out[]:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets._get_offset()
KeyError: 'DEKAD'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets._get_offset()
ValueError: Invalid frequency: DEKAD
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-155-aa7b4737fd5a> in <module>
7
8 df = pd.DataFrame({"y": np.random.random(len(dekads))}, index=dekads)
----> 9 df.index.freq = "dekad"
~/miniconda3/envs/ml/lib/python3.8/site-packages/pandas/core/indexes/extension.py in fset(self, value)
62
63 def fset(self, value):
---> 64 setattr(self._data, name, value)
65
66 fget.__name__ = name
~/miniconda3/envs/ml/lib/python3.8/site-packages/pandas/core/arrays/datetimelike.py in freq(self, value)
1090 def freq(self, value):
1091 if value is not None:
-> 1092 value = to_offset(value)
1093 self._validate_frequency(self, value)
1094
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
ValueError: Invalid frequency: dekad
# How can I create a new freq object in pandas
The purpose of this exercise:
df = pd.read_csv(
"https://gist.githubusercontent.com/tommylees112/2b1b2dda43d91ea9346a6edaa6788ec8/raw/644af74955ce078d1c4d55a2ffd6a55eeb59bad4/demo_data_SO_02092021.csv"
).astype({"time": "datetime64[ns]"}).set_index("time")
train, test = df.iloc[:-100], df.iloc[-100:]
f, ax = plt.subplots(figsize=(12, 4))
ax.plot(train, label="train")
ax.plot(test, label="test")
plt.xticks(rotation=70)
plt.legend()
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing
# set seasonality parameters
m = 36
alpha = 1/(2*m)
model = ExponentialSmoothing(train["vci"],trend="mul").fit()
preds = model.forecast(len(test))
preds.index = test.index
f, ax = plt.subplots(figsize=(12, 4))
ax.plot(train.index, model.fittedvalues, label="Train Preditions")
ax.plot(test.index, preds, label="Test Preditions")
ax.plot(df.index, df["vci"], ls="--", color="k", alpha=0.6)
plt.xticks(rotation=70)
plt.legend()
This forecast is clearly poor and does not reflect the learned seasonality. I believe this is an issue with the fact that no frequency has been assigned to the datetime index.
If there are alternative methods for achieving these goals then I would be very keen to explore those options. I want to create a new frequency to assign to a pandas.DateTimeIndex. The reason is that statsmodels.tseries models require indexes with a given frequency to make forecasts.
Related
I have the following dataframe called 'data':
Month
Revenue Index
1920-01-01
1.72
1920-02-01
1.83
1920-03-01
1.94
...
...
2021-10-01
114.20
2021-11-01
115.94
2021-12-01
116.01
This is essentially a monthly revenue index on which I am trying to use seasonal_decompose with the following code:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative')
But unfortunately I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-08e3139bbf77> in <module>()
----> 1 result = seasonal_decompose(data['Consumptieprijsindex'], model='multiplicative')
2 rcParams['figure.figsize'] = 12, 6
3 plt.rc('lines', linewidth=1, color='r')
4
5 fig = result.plot()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
125 freq = pfreq
126 else:
--> 127 raise ValueError("You must specify a freq or x must be a "
128 "pandas object with a timeseries index with "
129 "a freq not set to None")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None
Does anyone know how to solve this issue? Thanks!
The following code in the comments answered my question:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative', period=12)
I am trying to obtain the MACD, MACD signal and MACD difference lines for stock prices given certain input. below is the custom code that I am using.
def create_MACD(long_term,short_term,dataframe,signal_ema_length):
#obtain the SMA data that we need to obtain the MACD ema values
short_sma = create_sma(short_term,dataframe)
long_sma = create_sma(long_term,dataframe)
#create the EMAs that will be subtracted to obtain the MACD line
short_ema = create_ema(short_term,2,dataframe)
long_ema = create_ema(long_term,2,dataframe)
#calculate length of MACD array and starting indicies for line and signal
length = len(dataframe)
#calculate the starting index of the line
start_line = long_term
#calculate the starting index of the signal line
start_signal = long_term+signal_ema_length
#create the smoothing variables for the signal line
smoothing = 2/(signal_ema_length+1)
smoothing_minus = 1-smoothing
#calculate number of iterations for macd and macd signal
num_iters_macd = len(dataframe)-long_term
num_iters_signal = num_iters_macd - signal_ema_length
#create the MACD dataframe change dataframe to array for iterations
macd = np.zeros(length)
macd_signal = np.zeros(length)
array = dataframe.to_numpy()
#for loop for MACD data
for i in range(num_iters_macd):
index = start_line+i
macd[index] = short_ema[index]-long_ema[index]
#for loop for MACD signal
for i in range(num_iters_signal):
index = start_signal+i
macd_signal[index] = macd[index]*smoothing + macd_signal[index-1]*smoothing_minus
#create sma of first X days of MACD
sma_MACD = sum(macd[:signal_ema_length])/signal_ema_length
#insert the first value into the MACD signal array
macd_signal[start_signal-1] = macd[start_signal-1]*smoothing +sma_MACD*smoothing_minus
#create array for MACD difference
macd_diff = np.zeros(length)
#create starting index for MACD difference
start_diff = start_signal
num_iters_diff = num_iters_signal
for i in range(num_iters_diff):
index = i+start_diff
macd_diff[index] = macd[index]-macd_signal[index]
#send all array's to pandas dataframe
MACD_line = pd.DataFrame(data=macd)
MACD_signal = pd.DataFrame(data=macd_signal)
MACD_difference = pd.DataFrame(data=macd_diff)
return MACD_line, MACD_signal, MACD_difference
macd_av,signal_av,diff_av = create_MACD(26,12,price,9)
The error that I get is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
354 try:
--> 355 return self._range.index(new_key)
356 except ValueError as err:
ValueError: 26 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-20-a1e7f9a89bbb> in <module>
----> 1 macd_av,signal_av,diff_av = create_MACD(26,12,price,9)
<ipython-input-19-78834be35c60> in create_MACD(long_term, short_term, dataframe, signal_ema_length)
35 for i in range(num_iters_macd):
36 index = start_line+i
---> 37 macd[index] = short_ema[index]-long_ema[index]
38
39 #for loop for MACD signal
~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]
~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
355 return self._range.index(new_key)
356 except ValueError as err:
--> 357 raise KeyError(key) from err
358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 26
I have tested the custom SMA and EMA functions so those are outputting the correct array's. I know that this error means that my for loop range is not correct but I am unsure of why this is wrong.
The problem seems to be that the short and long ema/sma array's at the beginning are in pandas dataframes. To index those correctly you need to use .iloc function. However, this doesn't work well when you use loops as you need to convert to numpy arrays and then the loop should work as intended.
I am trying to calculate the Sharpe ratio with a set of stock symbols. The code works with the first 5 stock symbols, however, it stops working after 6 symbols.
I searched the document for dimension errors that could possibly be the ValueError message but I do not see any possibilities. I also searched Quandl and Google for the error I was getting but could not get a specific result.
If someone could please let me know what I am doing wrong that would be great. I am very new to coding.
# import needed modules
import quandl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# get adjusted closing prices of 5 selected companies with Quandl
quandl.ApiConfig.api_key = 'oskr4yzppjZwxgJ7zNra'
selected = ['TGT', 'AAPL', 'MSFT', 'FCN', 'TSLA', 'SPY', 'XLV', 'BRK.B', 'WMT', 'JPM']
data = quandl.get_table('WIKI/PRICES', ticker = selected,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '2009-1-1', 'lte': '2019-12-31'}, paginate=True)
# reorganize data pulled by setting date as index width
# columns of tickers and their corresponding adjusted prices
clean = data.set_index('date')
table = clean.pivot(columns='ticker')
# calculate daily and annual returns of the stocks
returns_daily = table.pct_change()
returns_annual = returns_daily.mean() * 250
# get daily and covariance of returns of the stock
cov_daily = returns_daily.cov()
cov_annual = cov_daily * 250
# empty lists to store returns, volatility and weights of imiginary portfolios
port_returns = []
port_volatility = []
sharpe_ratio = []
stock_weights = []
# set the number of combinations for imaginary portfolios
num_assets = len(selected)
num_portfolios = 50000
# set random seed for reproduction's sake
np.random.seed(101)
# populate the empty lists with each portfolios returns,risk and weights
for single_portfolio in range(num_portfolios):
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(weights, returns_annual)
volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
sharpe = returns / volatility
sharpe_ratio.append(sharpe)
port_returns.append(returns)
port_volatility.append(volatility)
stock_weights.append(weights)
# a dictionary for Returns and Risk values of each portfolio
portfolio = {'Returns': port_returns,
'Volatility': port_volatility,
'Sharpe Ratio': sharpe_ratio}
# extend original dictionary to accomodate each ticker and weight in the portfolio
for counter,symbol in enumerate(selected):
portfolio[symbol+' weight'] = [weight[counter] for weight in stock_weights]
# make a nice dataframe of the extended dictionary
df = pd.DataFrame(portfolio)
# get better labels for desired arrangement of columns
column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' weight' for stock in selected]
# reorder dataframe columns
df = df[column_order]
# find min Volatility & max sharpe values in the dataframe (df)
min_volatility = df['Volatility'].min()
max_sharpe = df['Sharpe Ratio'].max()
# use the min, max values to locate and create the two special portfolios
sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
min_variance_port = df.loc[df['Volatility'] == min_volatility]
# plot the efficient frontier with a scatter plot
plt.style.use('seaborn-dark')
df.plot.scatter(x='Volatility', y='Returns', c='Sharpe Ratio',
cmap='RdYlGn', edgecolors='black', figsize=(10, 8), grid=True)
plt.scatter(x=sharpe_portfolio['Volatility'], y=sharpe_portfolio['Returns'], c='red', marker='D', s=200)
plt.scatter(x=min_variance_port['Volatility'], y=min_variance_port['Returns'], c='blue', marker='D', s=200)
plt.xlabel('Volatility (Std. Deviation)')
plt.ylabel('Expected Returns')
plt.title('Efficient Frontier')
plt.show()
# print the details of the 2 special portfolios
print(min_variance_port.T)
print(sharpe_portfolio.T)
The error I am getting is this:
ValueError Traceback (most recent call last)
<ipython-input-8-3e66668bf017> in <module>
42 weights = np.random.random(num_assets)
43 weights /= np.sum(weights)
---> 44 returns = np.dot(weights, returns_annual)
45 volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
46 sharpe = returns / volatility
ValueError: shapes (10,) and (7,) not aligned: 10 (dim 0) != 7 (dim 0)
I am using the playerStat.csv which includes 8 columns from which I only need 2. So I`m trying to create a new DataFrame with only those 2 columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("HLTVData/playerStats.csv")
dataset.head(20)
I only need the ADR and the Rating.
So I first create a matrix with the data set.
mat = dataset.as_matrix()
#4 is the ADR and 6 is the Rating
newDAtaSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
But it didn`t work, it threw an exception
NameError Traceback (most recent call last)
<ipython-input-10-1f975cc2514a> in <module>()
1 #4 is the ADR and 6 is the Rating
----> 2 newDataSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
NameError: name 'indexMatrix' is not defined
I also tried using the dataset.
newDataSet = pd.DataFrame(dataset, index=np.array(range(dataset.shape[0])), columns=dataset['ADR'])
/home/tensor/miniconda3/envs/tensorflow35openvc/lib/python3.5/site-packages/pandas/core/internals.py in _make_na_block(self, placement, fill_value)
3984
3985 dtype, fill_value = infer_dtype_from_scalar(fill_value)
-> 3986 block_values = np.empty(block_shape, dtype=dtype)
3987 block_values.fill(fill_value)
3988 return make_block(block_values, placement=placement)
MemoryError:
I think you need parameter usecols in read_csv:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=['ADR','Rating'])
Or:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=[4,6])
I am trying to draw some basic plots using the seaborn's jointplot() method.
My pandas data frame looks like this:
Out[250]:
YEAR Yields avgSumPcpn avgMaxSumTemp avgMinSumTemp
1970 5000 133.924981 30.437124 19.026974
1971 5560 107.691316 31.161974 19.278186
1972 5196 116.830066 31.454192 19.443712
1973 4233 181.550733 30.373581 19.097679
1975 5093 112.137538 30.428966 18.863224
I am trying to draw 'Yields' against 'YEAR' (So a plot to see how 'Yields' is varying over time). A simple plot.
But when I do this:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
I am getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-251-587582a746b8> in <module>()
3 #ax = plt.axes()
4 #sns_sum_reg_min_temp_pcpn = sns.regplot(x='avgSumPcpn',y='avgMaxSumTemp', data = df_sum_temp_pcpn)
----> 5 sns.jointplot(x='Yeilds',y='YEAR', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
6 plt.title('Avg Summer Precipitation vs Yields of Wharton TX', fontsize = 10)
7
//anaconda/lib/python2.7/site-packages/seaborn/distributions.pyc in jointplot(x, y, data, kind, stat_func, color, size, ratio, space, dropna, xlim, ylim, joint_kws, marginal_kws, annot_kws, **kwargs)
793 grid = JointGrid(x, y, data, dropna=dropna,
794 size=size, ratio=ratio, space=space,
--> 795 xlim=xlim, ylim=ylim)
796
797 # Plot the data using the grid
//anaconda/lib/python2.7/site-packages/seaborn/axisgrid.pyc in __init__(self, x, y, data, size, ratio, space, dropna, xlim, ylim)
1637 if dropna:
1638 not_na = pd.notnull(x) & pd.notnull(y)
-> 1639 x = x[not_na]
1640 y = y[not_na]
1641
TypeError: string indices must be integers, not Series
So I printed out the types of each column. Here is how:
for i in summer_pcpn_temp_yeild.columns.values.tolist():
print type(summer_pcpn_temp_yeild[[i]])
print type(summer_pcpn_temp_yeild.index.values)
which gives me:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<type 'numpy.ndarray'>
SO, I am not being able to understand how to fix it.
Any help would be greatly appreciated.
Thanks
Check that the YEAR and Yields have integer ( not string) types of values.
Try changing x='Yeilds' to x='Yields' in your call to jointplot:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
The error message is misleading. Seaborn can't find the column named "Yeilds" in your summer_pcpn_temp_yeild dataframe, because the dataframe column is spelled "Yields".
I had the same problem, and fixed it by correcting the x= argument to sns.jointplot()