Rolling average pairwise correlation in Python

Rolling average pairwise correlation in Python - python

I have daily returns from three markets (GLD, SPY, and USO). My goal is to calculate the the average pairwise correlation from a correlation matrix on a rolling basis of 130 days.
My starting point was:
import numpy as np
import pandas as pd
import os as os
import pandas.io.data as web
import datetime as datetime
from pandas.io.data import DataReader
stocks = ['spy', 'gld', 'uso']
start = datetime.datetime(2010,1,1)
end = datetime.datetime(2016,1,1)
df = web.DataReader(stocks, 'yahoo', start, end)
adj_close_df = df['Adj Close']
returns = adj_close_df.pct_change(1).dropna()
returns = returns.dropna()
rollingcor = returns.rolling(130).corr()
This creates a panel of correlation matrices. However, extracting the lower(or upper) triangles, removing the diagonals and then calculating the average for each observation is where I've drawn a blank. Ideally I would like the output for each date to be in a Series where I can then index it by the dates.
Maybe I've started from the wrong place but any help would be appreciated.

To get the average pairwise correlation, you can find the sum of the correlation matrix, substract n (ones on the diagonal), divide by 2 (symmetry), and finally divide by n (average). I think this should do it:
>>> n = len(stocks)
>>> ((rollingcor.sum(skipna=0).sum(skipna=0) - n) / 2) / n
Date
2010-01-05 NaN
2010-01-06 NaN
2010-01-07 NaN
...
2015-12-29 0.164356
2015-12-30 0.168102
2015-12-31 0.166462
dtype: float64

You could use numpy's tril to access the lower triangle of the dataframe.
def tril_sum(df):
# -1 ensures we skip the diagonal
return np.tril(df.unstack().values, -1).sum()
Calculates the sum of the lower triangle of the matrix. Notice the unstack() in the middle of that. I'm expecting to have a multiindex series that I'll need to pivot to a dataframe.
Then apply it to your panel
n = len(stock)
avg_cor = rollingcor.dropna().to_frame().apply(tril_sum) / ((n ** 2 - n) / 2)
Looks like:
print avg_cor.head()
Date
2010-07-12 0.398973
2010-07-13 0.403664
2010-07-14 0.402483
2010-07-15 0.403252
2010-07-16 0.407769
dtype: float64
This answer skips the diagonals.

Related

Optimization function yields wrong results

im trying to replicate a certain code from yuxing Yan's python for finance.
I am at a road block because I am getting very high minimized figures(in this case stock weights, which ca be both +(long) and (-short) after optimization with fmin().
can anyone help me with a fresh pair of eyes. I have seen some suggestion about avoiding passing negative or complex figures to fmin() but I can't afford to as its vital to my code
#Lets import our modules
from scipy.optimize import fmin #to minimise our negative sharpe-ratio
import numpy as np#deals with numbers python
from datetime import datetime#handles date objects
import pandas_datareader.data as pdr #to read download equity data
import pandas as pd #for reading and accessing tables etc
import scipy as sp
from scipy.stats import norm
import scipy.stats as stats
from scipy.optimize import fminbound
assets=('AAPL',
'IBM',
'GOOG',
'BP',
'XOM',
'COST',
'GS')
#start and enddate to be downloaded
startdate='2016-01-01'
enddate='2016-01-31'
rf_rate=0.0003
n=len(assets)
#_______________________________________________
#This functions takes the assets,start and end dates and
#returns portfolio return
#__________________________________________________
def port_returns (assets,startdate,enddate):
#We use adjusted clsoing prices of sepcified dates of assets
#as we will only be interested in returns
data = pdr.get_data_yahoo(assets, start=startdate, end=enddate)['Adj Close']
#We calculate the percentage change of our returns
#using pct_change function in python
returns=data.pct_change()
return returns
def portfolio_variance(returns,weight):
#finding the correlation of our returns by
#dropping the nan values and transposing
correlation_coefficient = np.corrcoef(returns.dropna().T)
#standard deviation of our returns
std=np.std(returns,axis=0)
#initialising our variance
port_var = 0.0
#creating a nested loop to calculate our portfolio variance
#where the variance is w12σ12 + w22σ22 + 2w1w2(Cov1,2)
#and correlation coefficient is given by covaraince btn two assets divided by standard
#multiplication of standard deviation of both assets
for i in range(n):
for j in range(n):
#we calculate the variance by continuously summing up the varaince between two
#assets using i as base loop, multiplying by std and corrcoef
port_var += weight[i]*weight[j]*std[i]*std[j]*correlation_coefficient[i, j]
return port_var
def sharpe_ratio(returns,weights):
#call our variance function
variance=portfolio_variance(returns,weights)
avg_return=np.mean(returns,axis=0)
#turn our returns to an array
returns_array = np.array(avg_return)
#Our sharpe ratio uses expected return gotten from multiplying weights and return
# and standard deviation gotten by square rooting our variance
#https://en.wikipedia.org/wiki/Sharpe_ratio
return (np.dot(weights,returns_array) - rf_rate)/np.sqrt(variance)
def negate_sharpe_ratio(weights):
#returns=port_returns (assets,startdate,enddate)
#creating an array with our weights by
#summing our n-1 inserted and subtracting by 1 to make our last weight
weights_new=np.append(weights,1-sum(weights))
#returning a negative sharpe ratio
return -(sharpe_ratio(returns_data,weights_new))
returns_data=port_returns(assets,startdate,enddate)
# for n stocks, we could only choose n-1 weights
ones_weights_array= (np.ones(n-1, dtype=float) * 1.0 )/n
weight_1 = fmin(negate_sharpe_ratio,ones_weights_array)
final_weight = np.append(weight_1, 1 - sum(weight_1))
final_sharpe_ratio = sharpe_ratio(returns_data,final_weight)
print ('Optimal weights are ')
print (final_weight)
print ('final Sharpe ratio is ')
print(final_sharpe_ratio)

A few things are causing your code not to work as written
is assets the list of items in ticker?
shouldstartdate be set equal to begdate?
Your call to port_returns() is looking for both assets and startdate which are never defined.
Function sharpe_ratio() is looking for a variable called rf_rate which is never defined. I assume this is the risk-free rate and the value assigned to rf at the beginning of the script. So should rf be called rf_rate instead?
After changing rf to rf_rate, begdate to startdate, and setting assets = list(ticker), it appears that this will work as written

Apply a function to a numpy array

I have a numpy array with data from yahoo finance that i got like this:
!pip install yfinance
import yfinance
tickers = yfinance.Tickers('GCV22.CMX CLV22.NYM')
So for each symbol I have open low high close prices as well as volume, on a daily basis:
Open High Low Close Volume Dividends Stock Splits
Date
2021-09-20 1752.000000 1766.000000 1740.500000 1761.800049 3656 0 0
2021-09-21 1763.400024 1780.800049 1756.300049 1776.099976 11490 0 0
2021-09-22 1773.099976 1785.900024 1762.800049 1776.699951 6343 0 0
2021-09-23 1766.900024 1774.500000 1736.300049 1747.699951 10630 0 0
2021-09-24 1741.300049 1755.599976 1738.300049 1749.699951 10630 0 0
I found this function in a paper (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2422183) and I would like to apply it to my dataset, but I can't understand how to apply it:
def fitKCA(t,z,q,fwd=0):
'''
Inputs:
t: Iterable with time indices
z: Iterable with measurements
q: Scalar that multiplies the seed states covariance
fwd: number of steps to forecast (optional, default=0)
Output:
x[0]: smoothed state means of position velocity and acceleration
x[1]: smoothed state covar of position velocity and acceleration
Dependencies: numpy, pykalman
'''
#1) Set up matrices A,H and a seed for Q
h=(t[-1]-t[0])/t.shape[0]
A=np.array([[1,h,.5*h**2],
[0,1,h],
[0,0,1]])
Q=q*np.eye(A.shape[0])
#2) Apply the filter
kf=KalmanFilter(transition_matrices=A,transition_covariance=Q)
#3) EM estimates
kf=kf.em(z)
#4) Smooth
x_mean,x_covar=kf.smooth(z)
#5) Forecast
for fwd_ in range(fwd):
x_mean_,x_covar_=kf.filter_update(filtered_state_mean=x_mean[-1], \
filtered_state_covariance=x_covar[-1])
x_mean=np.append(x_mean,x_mean_.reshape(1,-1),axis=0)
x_covar_=np.expand_dims(x_covar_,axis=0)
x_covar=np.append(x_covar,x_covar_,axis=0)
#6) Std series
x_std=(x_covar[:,0,0]**.5).reshape(-1,1)
for i in range(1,x_covar.shape[1]):
x_std_=x_covar[:,i,i]**.5
x_std=np.append(x_std,x_std_.reshape(-1,1),axis=1)
return x_mean,x_std,x_covar
In the paper they say: Numpy array t conveys the index of
observations. Numpy array z passes the observations. Scalar q provides a seed value for
initializing the EM estimation of the states covariance. How can i call this function with my data? I understand t should be the index column of each symbol, that is the data column, the z is the close price for each symbol of my numpy array, and q a random seed, but i can't make it works

The function in the paper states that you need :
t: Iterable with time indices
z: Iterable with measurements
q: Scalar that multiplies the seed states covariance
here is how you would compute them :
import yfinance
from random import random
tickers = yfinance.Ticker('MSFT')
history = tickers.history()
# t is the timestamps indexed at 0 for each row
t = [h[0] for h in history.values]
# z is the measurement here choosing open price
z = history.values.Open
# q random seeds
q = [random() for _ in t]
# finally call the function
fitKCA(t,z, q)

Calculating RSI for recent days

I'm working on improving my algo bot and one thing that I have implemented absolutely awful is RSI. Since RSI is a lagging indicator I can't get recent data, the last date I get a value for is 8 days ago. I'm therefore looking to calculate it somehow by using previous values and looking for ideas on how to do so.
My data points:
[222.19000244140625, nan]
[222.19000244140625, nan]
[215.47000122070312, nan]
[212.25, nan]
[207.97000122070312, nan]
[206.3300018310547, nan]
[205.88999938964844, nan]
[208.36000061035156, nan]
[204.08999633789062, 10.720487433358727]
[197.00999450683594, 7.934105468501102]
[194.6699981689453, 7.224811311424375]
[190.66000366210938, 6.148330770309926]
[191.6300048828125, 9.861218420857213]
[189.13999938964844, 8.835726925023536]
[189.02000427246094, 8.785409465194874]
[187.02000427246094, 7.925663008903896]
[195.69000244140625, 37.989974096922204]
[196.9199981689453, 41.10776671337689]
[194.11000061035156, 36.33757785797855]
As you can see 10.720487433358727 is my most recent value but I'm sure bigger brains than mine can figure out a way to calculate it up until today.
Thanks for your help!

It is important to note that there are various ways of defining the RSI. It is commonly defined in at least two ways: using a simple moving average (SMA) as above, or using an exponential moving average (EMA). Here's a code snippet that calculates both definitions of RSI and plots them for comparison. I'm discarding the first row after taking the difference, since it is always NaN by definition.
import pandas
import pandas_datareader.data as web
import datetime
import matplotlib.pyplot as plt
# Window length for moving average
window_length = 14
# Dates
start = '2020-12-01'
end = '2021-01-27'
# Get data
data = web.DataReader('AAPL', 'yahoo', start, end)
# Get just the adjusted close
close = data['Adj Close']
# Get the difference in price from previous step
delta = close.diff()
# Get rid of the first row, which is NaN since it did not have a previous
# row to calculate the differences
delta = delta[1:]
# Make the positive gains (up) and negative gains (down) Series
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0
# Calculate the EWMA
roll_up1 = up.ewm(span=window_length).mean()
roll_down1 = down.abs().ewm(span=window_length).mean()
# Calculate the RSI based on EWMA
RS1 = roll_up1 / roll_down1
RSI1 = 100.0 - (100.0 / (1.0 + RS1))
# Calculate the SMA
roll_up2 = up.rolling(window_length).mean()
roll_down2 = down.abs().rolling(window_length).mean()
# Calculate the RSI based on SMA
RS2 = roll_up2 / roll_down2
RSI2 = 100.0 - (100.0 / (1.0 + RS2))
# Compare graphically
plt.figure(figsize=(8, 6))
RSI1.plot()
RSI2.plot()
plt.legend(['RSI via EWMA', 'RSI via SMA'])
plt.show()

Nan values when using np.linspace() as input

I am trying to calculate the probability of transmission for an electron through a series of potential wells. When looping through energy values using np.linspace() I get a return of nan for any value under 15. I understand this for values of 0 and 15, since they return a value of zero in the denominator for the k and q values. If I simply call getT(5) for example, I get a real value. However when getT(5) gets called from the loop using np.linspace(0,30,2001) then it returns nan. Shouldnt it return either nan or a value in both cases?
import numpy as np
import matplotlib.pyplot as plt
def getT(Ein):
#constants
hbar=1.055e-34 #J-s
m=9.109e-31 #mass of electron kg
N=10 #number of cells
a=1e-10 #meters
b=2e-10 #meters
#convert energy and potential to Joules
conv_J=1.602e-19
E_eV=Ein
V_eV=15
E=conv_J*E_eV
V=conv_J*V_eV
#calculate values for k and q
k=(2*m*E/hbar**2)**.5
q=(2*m*(E-V)/hbar**2)**.5
#create M1, M2 Matrices
M1=np.matrix([[((q+k)/(2*q))*np.exp(1j*k*b),((q-k)/(2*q))*np.exp(-1j*k*b)], \
[((q-k)/(2*q))*np.exp(1j*k*b),((q+k)/(2*q))*np.exp(-1j*k*b)]])
M2=np.matrix([[((q+k)/(2*k))*np.exp(1j*q*a),((k-q)/(2*k))*np.exp(-1j*q*a)], \
[((k-q)/(2*k))*np.exp(1j*q*a),((q+k)/(2*k))*np.exp(-1j*q*a)]])
#calculate M_Cell
M_Cell=M1*M2
#calculate M for N cells
M=M_Cell**N
#get items in M_Cell
M11=M.item(0,0)
M12=M.item(0,1)
M21=M.item(1,0)
M22=M.item(1,1)
#calculate r and t values
r=-M21/M22
t=M11-M12*M21/M22
#calculate final T value
T=abs(t)**2
return Ein,T
#create empty array for data to plot
data=[]
#Calculate T for 500 values of E in between 0 and 30 eV
for i in np.linspace(0,30,2001):
data.append(getT(i))
data=np.transpose(data)
#generate plot
fig, (ax1)=plt.subplots(1)
ax1.set_xlim([0,30])
ax1.set_xlabel('Energy (eV)',fontsize=32)
ax1.set_ylabel('T',fontsize=32)
ax1.grid()
plt.tick_params(labelsize=32)
plt.plot(data[0],data[1],lw=6)
plt.draw()
plt.show()

I think the difference comes from the line
q=(2*m*(E-V)/hbar**2)**.5
When testing with single values between 0 and 15, you're basically taking the root of a negative number (because E-V is negative), which is irrational, for example:
(-2)**0.5
>> (8.659560562354934e-17+1.4142135623730951j)
But when using np.linspace, you take the root of a NumPy array with negative values, which results in nan (and a warning):
np.array(-2)**0.5
>> RuntimeWarning: invalid value encountered in power
>> nan

Python pandas most pythonic way to calculate rolling betas on a DataFrame [duplicate]

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))

Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)

Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same

While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.

Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.

Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.

but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rolling average pairwise correlation in Python - python

Related

Optimization function yields wrong results

Apply a function to a numpy array

Calculating RSI for recent days

Nan values when using np.linspace() as input

Python pandas most pythonic way to calculate rolling betas on a DataFrame [duplicate]

Categories

Resources