computing PCA in pandas using an expanded window - python

I have the following DataFrame of market data:
DP PE BM CAPE
date
1990-01-31 0.0345 13.7235 0.503474 6.460694
1990-02-01 0.0346 13.6861 0.504719 6.396440
1990-02-02 0.0343 13.7707 0.501329 6.440094
1990-02-05 0.0342 13.7676 0.500350 6.460417
1990-02-06 0.0344 13.6814 0.503550 6.419991
... ... ... ... ...
2015-04-28 0.0201 18.7347 0.346717 26.741581
2015-04-29 0.0202 18.6630 0.348080 26.637641
2015-04-30 0.0205 18.4793 0.351642 26.363959
2015-05-01 0.0204 18.6794 0.347814 26.620701
2015-05-04 0.0203 18.7261 0.346813 26.695087
For every day in this timeseries, I want to compute the largest PCA component using a backwards looking expanded window. The following code gives me the DF from above:
def get_PCAprice_daily(start_date = '1990-06-08', end_date = '2015-09-30'):
start_date = pd.to_datetime(start_date, yearfirst=True) - pd.DateOffset(years=1)
end_date = pd.to_datetime(end_date, yearfirst=True)
if(start_date > end_date):
print("Invalid date range provided")
return 1
dp = get_DP_daily().reset_index()
pe = get_PE_daily().reset_index()
bm = get_BM_daily().reset_index()
cape = get_CAPE_daily().reset_index()
variables = [pe, bm, cape]
for var in variables:
dp = dp.merge(var, how='left', on='date')
df = dp.set_index('date')
df = df.loc[start_date:end_date].dropna()
I've tried several different ways myself, however none seem to allow me to access the eigenvalues and vectors of the PCA so that I can do what this post says to remove noise by keeping consistent signs. This is a graph of what my current PCA values look like, and the sign-switching is a very big issue:
My incorrect PCA computation code:
window = 252*5
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((df.shape[0] - window + 1, df.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
df = df.reset_index()
df = df.join(pd.DataFrame(df_pca[0]))
df.rename(columns={0: 'PCAprice'}, inplace=True)
df['PCAprice'] = df['PCAprice'].shift(window)

Related

Sarimax endogenous and exogenous variables - Provided exogenous values are not of the appropriate shape

The issue that I have is with a rather simple approach of forecasting time series in python using SARIMAX model and 2 variables:
endogenous: the one of interest.
exogenous: the one assumed to have some influence on the endogenous variable.
The example uses the daily values of BTC and ETH, where BTC is endogenous, and ETH is endogenous.
import datetime
import numpy
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
import pmdarima as pm
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from datetime import date
from math import sqrt
from dateutil.relativedelta import relativedelta
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools
from random import random
import yfinance as yf
plt.style.use('ggplot')
The method of fetching data is quite simple using yahoo.finance API as yf
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df1 = yf.download(ticker, period = 'max', interval = '1d')
df1.reset_index(inplace = True)
df1
This needs to be done manually - insert the name of the coin by hand (gives more freedom to the user in terms of what coins are combined).
Enter your ticker: BTC-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2014-09-17 465.864014 468.174011 452.421997 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 413.104004 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 384.532013 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 389.882996 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 393.181000 398.821014 398.821014 26580100
... ... ... ... ... ... ... ...
2677 2022-01-15 43101.898438 43724.671875 42669.035156 43177.398438 43177.398438 18371348298
2678 2022-01-16 43172.039062 43436.808594 42691.023438 43113.878906 43113.878906 17902097845
2679 2022-01-17 43118.121094 43179.390625 41680.320312 42250.550781 42250.550781 21690904261
2680 2022-01-18 42250.074219 42534.402344 41392.214844 42375.632812 42375.632812 22417209227
2681 2022-01-19 42365.046875 42462.070312 41248.902344 42142.539062 42142.539062 24763551744
2682 rows × 7 columns
So df1 is our exogenous data. Then the endogenous data are fetched in the same manner.
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df2 = yf.download(ticker, period = 'max', interval = '1d')
df2.reset_index(inplace = True)
df2
Enter your ticker: ETH-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984
... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144
1533 rows × 7 columns
Now is a merging step where the two datasets are aligned.
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
data = df2.merge(df1, on = 'Date', how = 'left')
which looks like this:
Date Open High Low Close_x Adj Close Volume Close_y
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984 7143.580078
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984 6618.140137
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992 6357.600098
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936 5950.069824
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984 6559.490234
... ... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078 43177.398438
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874 43113.878906
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617 42250.550781
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091 42375.632812
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144 42142.539062
1533 rows × 8 columns
I want to focus solely on the closing price of BTC and ETH:
X = data[['Close_y', 'Date']]
y = data['Close_x']
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42, shuffle = False)
# grid search
X_train = X_train.drop('Date', axis = 1)
X_test = X_test.drop('Date', axis = 1)
Look for the best grid:
# Define the p, d and q parameters to take any value between 0 and 3 (exclusive)
p = d = q = range(0, 1)
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
# put 12 in the 's' position as we have monthly data
pdqs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
### Run Grid Search ###
def sarimax_gridsearch(pdq, pdqs, maxiter=5):
ans = []
for comb in pdq:
for combs in pdqs:
try:
mod = SARIMAX(y_train, exog=X_train, order=comb, seasonal_order=combs)
output = mod.fit(maxiter=maxiter)
ans.append([comb, combs, output.bic])
print('SARIMAX {} x {}12 : BIC Calculated ={}'.format(comb, combs, output.bic))
except:
continue
# Find the parameters with minimal BIC value
# Convert into dataframe
ans_df = pd.DataFrame(ans, columns=['pdq', 'pdqs', 'bic'])
# Sort and return top 5 combinations
ans_df = ans_df.sort_values(by=['bic'], ascending=True)
print(ans_df)
ans_df = ans_df.iloc[0]
return ans_df['pdq'], ans_df['pdqs']
o, s = sarimax_gridsearch(pdq, pdqs)
Make the predictions
# future predictions
# create Exogenous variables
df1 = df1.reset_index()
df1 = df1.set_index('Date')
df1 = df1.sort_index()
li = []
ys = ['Close']
for i in ys:
a = df1[i]
train_set, test_set = np.split(a, [int(.80 * len(a))])
model = pm.auto_arima(train_set, stepwise=True, error_action='ignore',seasonal=True, m=7)
b = model.get_params()
order = b.get('order')
s_order = b.get('seasonal_order')
model = sm.tsa.statespace.SARIMAX(a,
order=order,
seasonal_order=s_order
)
model_fit = model.fit()
start_index = data.index.max().date()+ relativedelta(days=1)
end_index = date(start_index.year, start_index.month , start_index.day+10)
forecast = model_fit.predict(start=start_index, end=end_index)
#start_index = data.shape[0]
#end_index = start_index + 12
#forecast = model_fit.predict(start=start_index, end=end_index)
li.append(forecast)
df = pd.DataFrame(li)
df = df.transpose()
df.columns = ys
df = df.reset_index()
exo = df[['Close', 'index']]
exo = exo.set_index('index')
But when I try to make the future predictions based on exo, like this:
#fit the model
print(b, s)
model_best = SARIMAX(y,exog=X.drop(['Date'],1), order=o, seasonal_order=s)
model_fit = model_best.fit()
model_fit.summary()
model_fit.plot_diagnostics(figsize=(15,12))
start_index = data.shape[0]
end_index = start_index + 12
pred_uc = model_fit.forecast(steps=13, start_index = start_index, end_index = end_index, exog = exo)
future_df = pd.DataFrame({'pred' : pred_uc})
print('Forecast:')
print(future_df)
plt.rcParams["figure.figsize"] = (8, 5)
#data = data.set_index('time')
plt.plot(data['Close_x'],color = 'blue', label = 'Actual')
plt.plot(pred_uc, color = 'orange',label = 'Predicted')
plt.show()
I get this annoying error:
ValueError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py in _validate_out_of_sample_exog(self, exog, out_of_sample)
1757 try:
-> 1758 exog = exog.reshape(required_exog_shape)
1759 except ValueError:
ValueError: cannot reshape array of size 11 into shape (13,1)
ValueError: Provided exogenous values are not of the appropriate shape. Required (13, 1), got (11, 1).
Can someone explain where I am wrong or what steps I missed in this module?
Check the shape of the exo variable. If you are forecasting 13 steps, then you need to provide exog variables for each of those 13 steps. The error message is saying that you only provided exog variables for 11 steps. You can either provide a larger array to the exog argument, or you can change the forecast to be for 11 steps.

Fetching date of high and low prices for week based on daily high low prices

First of all I will share objective of running python code.
Getting Daily High and Low Prices for a stock from Yahoo.
Converting the daily high and lows to Weekly High/Lows, monthly High Lows, Yearly High Lows.
Getting exact dates of Weekly or Monthly High Lows from a daily dataframe
Finally after fetching Dates for Weekly(or Monthly)High & lows, I want to arrange the data of what occured first High or Low during the week. for eg. during week ending 12th December, 2020, I get High of the week is 100 and low of week is 97(after completing step 2) and also High date and low date from daily dataframe (from step 3), I want to arrange Prices in order of occurence. so if High happened on 9th December and Low happened on 12th December. The prices will be arranged as 100 in row 1 and then 97 in row 2 and this process repeats for entire data frame.
What I have been able to achieve.
I have completed step 1 and step 2. Struggling in step for 3 as of now.
Have accomplished Step 1 by
import pandas as pd
import yfinance as yf
Ticker = '^NSEI'
f = yf.download(Ticker,period="max")
f = f.drop(['Adj Close'], axis=1)
f = f.drop(['Open'], axis=1)
f = f.drop(['Close'], axis=1)
f = f.drop(['Volume'], axis=1)
f.reset_index(inplace=True)
f.insert(0,'Ticker',Ticker)
Step 2 by
fw = f.groupby(['Ticker', pd.Grouper(key='Date', freq='W')])\
.agg(High=pd.NamedAgg(column='High', aggfunc='max'),
Low=pd.NamedAgg(column='Low', aggfunc='min'))\
.reset_index()
fm = f.groupby(['Ticker', pd.Grouper(key='Date', freq='M')])\
.agg(High=pd.NamedAgg(column='High', aggfunc='max'),
Low=pd.NamedAgg(column='Low', aggfunc='min'))\
.reset_index()
fq = f.groupby(['Ticker', pd.Grouper(key='Date', freq='Q')])\
.agg(High=pd.NamedAgg(column='High', aggfunc='max'),
Low=pd.NamedAgg(column='Low', aggfunc='min'))\
.reset_index()
fy = f.groupby(['Ticker', pd.Grouper(key='Date', freq='Y')])\
.agg(High=pd.NamedAgg(column='High', aggfunc='max'),
Low=pd.NamedAgg(column='Low', aggfunc='min'))\
.reset_index()
Struggling with step 3. used pd.merge, pd.join, pd.concat but unable to combine Weekly dataframe with dataframe on Highs and lows. The no of weekly records increase by performing merge and drop duplcates also didn't work properly when specified keep last.
So if you all can help me in step 3 and 4 would be grateful. Thanks
Solved the query which i posted above. Hope this help others. Thanks
import pandas as pd
import yfinance as yf
import datetime as dt
import numpy as np
Ticker = '^NSEI'
df = yf.download(Ticker, period='max')
df= df.drop(['Open', 'Close', 'Adj Close', 'Volume'], axis = 1).reset_index()
# Daily 3238 columns for reference
#Adding columns for weekly, monthly,6 month,Yearly,
df['WkEnd'] = df.Date.dt.to_period('W').apply(lambda r: r.start_time) + dt.timedelta(days=6)
df['MEnd'] = (df.Date.dt.to_period('M').apply(lambda r: r.end_time)).dt.date
df['6Mend'] = np.where(df.Date.dt.month <= 6,(df.Date.dt.year).astype(str)+'-1H',(df['Date'].dt.year).astype(str)+'-2H')
df['YEnd'] = (df.Date.dt.to_period('Y').apply(lambda r: r.end_time)).dt.date
# key variable for melting
d = {'Date':['Hidate', 'Lodate'], 'Price':['High','Low']}
#creating weekly neoformat
dw = df.groupby(['WkEnd']).agg({'High' : 'max','Low' : 'min' }).reset_index()
dw['Hidate'] = dw[['WkEnd','High']].merge(df,how = 'left').Date
dw['Lodate'] = dw[['WkEnd','Low']].merge(df,how = 'left').Date
dw = pd.lreshape(dw,d)
dw = dw.sort_values(by = ['Date']).reset_index()
dw = dw.drop(['index'], axis = 1)
#creating Monthly neoformat
dm = df.groupby(['MEnd']).agg({'High' : 'max','Low' : 'min' }).reset_index()
dm['Hidate'] = dm[['MEnd','High']].merge(df,how = 'left').Date
dm['Lodate'] = dm[['MEnd','Low']].merge(df,how = 'left').Date
dm = pd.lreshape(dm,d)
dm = dm.sort_values(by = ['Date']).reset_index()
dm = dm.drop(['index'], axis = 1)
#creating 6mth neoformat
d6m = df.groupby(['6Mend']).agg({'High' : 'max','Low' : 'min' }).reset_index()
d6m['Hidate'] = d6m[['6Mend','High']].merge(df,how = 'left').Date
d6m['Lodate'] = d6m[['6Mend','Low']].merge(df,how = 'left').Date
d6m = pd.lreshape(d6m,d)
d6m = d6m.sort_values(by = ['Date']).reset_index()
d6m = d6m.drop(['index'], axis = 1)
#creating Yearly neoformat
dy = df.groupby(['YEnd']).agg({'High' : 'max','Low' : 'min' }).reset_index()
dy['Hidate'] = dy[['YEnd','High']].merge(df,how = 'left').Date
dy['Lodate'] = dy[['YEnd','Low']].merge(df,how = 'left').Date
dy = pd.lreshape(dy,d)
dy = dy.sort_values(by = ['Date']).reset_index()
dy = dy.drop(['index'], axis = 1)

How to scale a dataframe with datetime field in it (as a index)?

I want to scale a dataframe, which raises the error as in the title (or below).
My data:
df.head()
timestamp open high low close volume
0 2020-06-25 303.4700 305.26 301.2800 304.16 46340400
1 2020-06-24 309.8400 310.51 302.1000 304.09 123867696
2 2020-06-23 313.4801 314.50 311.6101 312.05 68066900
3 2020-06-22 307.9900 311.05 306.7500 310.62 74007212
4 2020-06-19 314.1700 314.38 306.5300 308.64 135211345
My code:
# Converting the index as date
from datetime import datetime
df.index = pd.to_datetime(df.index)
# Split data
split = len(df) - int(len(df) * 0.8)
df_train = df.iloc[split:]
df_test = df.iloc[:split]
# Normalize
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_train = df_train.values.reshape(-1,1) #df_train = scaler.fit_transform(df_train)
df_test = df_test.values.reshape(-1,1) #df_test = scaler.fit_transform(df_train)
# Train the Scaler with training data and smooth data
timestep = 21
for i in range(0,len(df),timestep):
df_train = scaler.fit_transform(df_train[i:i+timestep,:])
#train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
# You normalize the last bit of remaining data
df_test = scaler.fit_transform(df_test[i+timestep:,:])
#train_data[di+timestep:,:] = scaler.transform(train_data[di+timestep:,:])
The error:
2 timestep = 21
3 for i in range(0,len(df),timestep):
----> 4 df_train = scaler.fit_transform(df_train[i:i+timestep,:])
5 #train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
ValueError: could not convert string to float: '2020-05-28'
Help would be appraciated.
Simply iterate through the columns and scale each individually like this:
for col in X.columns:
X[col] = StandardScaler().fit_transform(X[col].to_numpy().reshape(-1,1)
you can create your own scaler if you want to do something within an SKlearn pipeline like this:
class Scaler(StandardScaler):
def __init__(self):
super().__init__()
def fit_transform(self):
for col in X.columns:
X[col] = StandardScaler().fit_transform(X[col].to_numpy().reshape(-1,1)
return X

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

python panda to calculate rolling means

I am trying to calculate the bollinger band of facebook stock. But I found the rm_FB (the calculated rolling mean) are all nan
def get_rolling_mean(values, window):
"""Return rolling mean of given values, using specified window size."""
t = pd.date_range('2016-02-01', '2016-06-06', freq='D')
# print("Hey")
# print(values);
D = pd.Series(values, t)
return D.rolling(window=20,center=False).mean()
def test_run():
# Read data
dates = pd.date_range('2016-02-01', '2016-06-06')
symbols = ['FB']
df = get_data(symbols, dates)
# Compute Bollinger Bands
# 1. Compute rolling mean
rm_FB = get_rolling_mean(df['FB'], window=20)
print("Hey")
print(rm_FB)
if __name__ == "__main__":
test_run()
I was confused by how you asked. I manufactured the data and created a function I hope helps.
import pandas as pd
import numpy as np
def bollinger_bands(s, k=2, n=20):
"""get_bollinger_bands DataFrame
s is series of values
k is multiple of standard deviations
n is rolling window
"""
b = pd.concat([s, s.rolling(n).agg([np.mean, np.std])], axis=1)
b['upper'] = b['mean'] + b['std'] * k
b['lower'] = b['mean'] - b['std'] * k
return b.drop('std', axis=1)
Demonstration
np.random.seed([3,1415])
s = pd.Series(np.random.randn(100) / 100, name='price').add(1.001).cumprod()
bollinger_bands(s).plot()

Categories

Resources