EWMA Covariance Matrix in Pandas - Optimization

EWMA Covariance Matrix in Pandas - Optimization - python

I would like to calculate the EWMA Covariance Matrix from a DataFrame of stock price returns using Pandas and have followed the methodology in PyPortfolioOpt.
I like the flexibility of using Pandas objects and functions but when the set of assets grows the function is becomes very slow:
import pandas as pd
import numpy as np
def ewma_cov_pairwise_pd(x, y, alpha=0.06):
x = x.mask(y.isnull(), np.nan)
y = y.mask(x.isnull(), np.nan)
covariation = ((x - x.mean()) * (y - y.mean()).dropna()
return covariation.ewm(alpha=0.06).mean().iloc[-1]
def ewma_cov_pd(rets, alpha=0.06):
assets = rets.columns
n = len(assets)
cov = np.zeros((n, n))
for i in range(n):
for j in range(i, n):
cov[i, j] = cov[j, i] = ewma_cov_pairwise_pd(
rets.iloc[:, i], rets.iloc[:, j], alpha=alpha)
return pd.DataFrame(cov, columns=assets, index=assets)
I would like to improve the speed of the code ideally while still using Pandas but the bottleneck is within the DataFrame.ewm() function which uses 90% of the calculation time.
If using this function was a binding constraint, what is the most efficient way of improving the speed at which the code runs? I was considering taking a brute force approach and using concurrent.futures.ProcessPoolExecutor but perhaps there is a better solutions.
n = 100 # n is typically 2000
rets = pd.DataFrame(np.random.normal(0, 1., size=(n, n)))
cov_pd = ewma_cov_pd(rets)
The true time-series data can contain leading nulls and potentially missing values after that although the latter less likely.
Update I
A potential solution which leverages off the answer provided by Quang Hoang and produces the expected results in a far more reasonable time would be something similar to:
def ewma_cov_frame_qh(rets, alpha=0.06):
weights = (1-alpha) ** np.arange(len(df))[::-1]
normalized = (rets-rets.mean()).to_numpy()
out = (weights * normalized.T) # normalized / weights.sum()
return pd.DataFrame(out, index=rets.columns, columns=rets.columns)
def ewma_cov_qh(rets, alpha=0.06):
syms = rets.columns
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
for date in dates:
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
cov = ewma_cov_frame_qh(frame).reindex(index=syms, columns=syms)
covar = covar.fillna(cov)
return covar
cov_qh = ewma_cov_qh(rets)
This violates the requirement that the underlying covariance is calculated using the native Pandas/Numpy functions and calculation time will depend on the number leading na's in the data set.
Update II
A potential improvement on the above which uses (a naive implementation of) multiprocessing and improves the calculation time by a further 42.5% on my machine is listed below:
from concurrent.futures import ProcessPoolExecutor, as_completed
from functools import partial
def ewma_cov_mp_worker(date, rets, alpha=0.06):
syms = rets.columns
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
return ewma_cov_frame_qh(frame, alpha=alpha).reindex(index=syms, columns=syms)
def ewma_cov_mp(rets, alpha=0.06):
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
func = partial(ewma_cov_mp_worker, rets=rets, alpha=alpha)
covs = {}
with ProcessPoolExecutor(max_workers=6) as exec:
future_to_date = {exec.submit(func, date): date for date in dates}
covs = {future_to_date[future]: future.result() for future in as_completed(future_to_date)}
for date in dates:
covar.fillna(covs[date], inplace=True)
return covar
[I have not added as answer as not addressed the original question and I am optimistic there is a better solution.]

since you don't really care for ewm, i.e, you only take the last value. We can try matrix multiplication:
def ewma(df, alpha=0.94):
weights = (1-alpha) ** np.arange(len(df))[::-1]
# fillna with 0 here
normalized = (df-df.mean()).fillna(0).to_numpy()
out = ((weights * normalized.T) # normalized / weights.sum()
return out
# verify
out = ewma(df)
print(out[0,1] == ewma_cov_pairwise(df[0],df[1]) )
# True
And this took about 150 ms on my system with df.shape==(2000,2000) while your code refuses to run within minutes :-).

Related

Vectorization for computing variance of a vector split at different points

I have a 1-D array arr and I need to compute the variance of all possible contiguous subvectors that begin at position 0. It may be easier to understand with a for loop:
np.random.seed(1)
arr = np.random.normal(size=100)
res = []
for i in range(1, arr.size+1):
subvector = arr[:i]
var = np.var(subvector)
res.append(var)
Is there any way to compute res witouth the for loop?

Yes, since var = sum_squares / N - mean**2, and mean = sum /N, you can do cumsum to get the accumulate sums:
cumsum = np.cumsum(arr)
cummean = cumsum/(np.arange(len(arr)) + 1)
sq = np.cumsum(arr**2)
# correct the dof here
cumvar = sq/(np.arange(len(arr))+1) - cummean**2
np.allclose(res, cumvar)
# True

With pandas, you could use expanding:
import pandas as pd
pd.Series(arr).expanding().var(ddof=0).values
NB. one of the advantages is that you can benefit from the var parameters (by default ddof=1), and of course, you can run many other methods.

Calculating XIRR in Python

I need to calculate XIRR of financial investments made over a period of time. Is there any function to do this in numpy, pandas or plain python?
Reference: What is XIRR?
The accepted answer in the original question is not correct and can be improved.

Created a package for fast XIRR calculation, PyXIRR
It doesn't have external dependencies and works faster than any existing implementation.
from datetime import date
from pyxirr import xirr
dates = [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)]
amounts = [-1000, 1000, 1000]
# feed columnar data
xirr(dates, amounts)
# feed tuples
xirr(zip(dates, amounts))
# feed DataFrame
import pandas as pd
xirr(pd.DataFrame({"dates": dates, "amounts": amounts}))

Here's an implementation taken from here.
import datetime
from scipy import optimize
def xnpv(rate,cashflows):
chron_order = sorted(cashflows, key = lambda x: x[0])
t0 = chron_order[0][0]
return sum([cf/(1+rate)**((t-t0).days/365.0) for (t,cf) in chron_order])
def xirr(cashflows,guess=0.1):
return optimize.newton(lambda r: xnpv(r,cashflows),guess)

This implementation calculates the time delta once and then vectorizes the NPV calculation. It should run much faster than #pyCthon's solution for larger datasets. The input is a pandas series of cashflows with dates for the index.
Code
import pandas as pd
import numpy as np
from scipy import optimize
def xirr2(valuesPerDate):
""" Calculate the irregular rate of return.
valuesPerDate is a pandas series of cashflows with index of dates.
"""
# Clean values
valuesPerDateCleaned = valuesPerDate[valuesPerDate != 0]
# Check for sign change
if valuesPerDateCleaned.min() * valuesPerDateCleaned.max() >= 0:
return np.nan
# Set index to time delta in years
valuesPerDateCleaned.index = (valuesPerDateCleaned.index - valuesPerDateCleaned.index.min()).days / 365.0
result = np.nan
try:
result = optimize.newton(lambda r: (valuesPerDateCleaned / ((1 + r) ** valuesPerDateCleaned.index)).sum(), x0=0, rtol=1e-4)
except (RuntimeError, OverflowError):
result = optimize.brentq(lambda r: (valuesPerDateCleaned / ((1 + r) ** valuesPerDateCleaned.index)).sum(), a=-0.999999999999999, b=100, maxiter=10**4)
if not isinstance(result, complex):
return result
else:
return np.nan
Tests
valuesPerDate = pd.Series()
for d in pd.date_range(start='1990-01-01', end='2019-12-31', freq='M'):
valuesPerDate[d] = 10*np.random.uniform(-0.5,1)
valuesPerDate[0] = -100
print(xirr2(valuesPerDate))

Exponential Moving Average by time interval [duplicate]

I have a range of dates and a measurement on each of those dates. I'd like to calculate an exponential moving average for each of the dates. Does anybody know how to do this?
I'm new to python. It doesn't appear that averages are built into the standard python library, which strikes me as a little odd. Maybe I'm not looking in the right place.
So, given the following code, how could I calculate the moving weighted average of IQ points for calendar dates?
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
(there's probably a better way to structure the data, any advice would be appreciated)

EDIT:
It seems that mov_average_expw() function from scikits.timeseries.lib.moving_funcs submodule from SciKits (add-on toolkits that complement SciPy) better suits the wording of your question.
To calculate an exponential smoothing of your data with a smoothing factor alpha (it is (1 - alpha) in Wikipedia's terms):
>>> alpha = 0.5
>>> assert 0 < alpha <= 1.0
>>> av = sum(alpha**n.days * iq
... for n, iq in map(lambda (day, iq), today=max(days): (today-day, iq),
... sorted(zip(days, IQ), key=lambda p: p[0], reverse=True)))
95.0
The above is not pretty, so let's refactor it a bit:
from collections import namedtuple
from operator import itemgetter
def smooth(iq_data, alpha=1, today=None):
"""Perform exponential smoothing with factor `alpha`.
Time period is a day.
Each time period the value of `iq` drops `alpha` times.
The most recent data is the most valuable one.
"""
assert 0 < alpha <= 1
if alpha == 1: # no smoothing
return sum(map(itemgetter(1), iq_data))
if today is None:
today = max(map(itemgetter(0), iq_data))
return sum(alpha**((today - date).days) * iq for date, iq in iq_data)
IQData = namedtuple("IQData", "date iq")
if __name__ == "__main__":
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
iqdata = list(map(IQData, days, IQ))
print("\n".join(map(str, iqdata)))
print(smooth(iqdata, alpha=0.5))
Example:
$ python26 smooth.py
IQData(date=datetime.date(2008, 1, 1), iq=110)
IQData(date=datetime.date(2008, 1, 2), iq=105)
IQData(date=datetime.date(2008, 1, 7), iq=90)
95.0

I'm always calculating EMAs with Pandas:
Here is an example how to do it:
import pandas as pd
import numpy as np
def ema(values, period):
values = np.array(values)
return pd.ewma(values, span=period)[-1]
values = [9, 5, 10, 16, 5]
period = 5
print ema(values, period)
More infos about Pandas EWMA:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.ewma.html

I did a bit of googling and I found the following sample code (http://osdir.com/ml/python.matplotlib.general/2005-04/msg00044.html):
def ema(s, n):
"""
returns an n period exponential moving average for
the time series s
s is a list ordered from oldest (index 0) to most
recent (index -1)
n is an integer
returns a numeric array of the exponential
moving average
"""
s = array(s)
ema = []
j = 1
#get n sma first and calculate the next n period ema
sma = sum(s[:n]) / n
multiplier = 2 / float(1 + n)
ema.append(sma)
#EMA(current) = ( (Price(current) - EMA(prev) ) x Multiplier) + EMA(prev)
ema.append(( (s[n] - sma) * multiplier) + sma)
#now calculate the rest of the values
for i in s[n+1:]:
tmp = ( (i - ema[j]) * multiplier) + ema[j]
j = j + 1
ema.append(tmp)
return ema

You can also use the SciPy filter method because the EMA is an IIR filter. This will have the benefit of being approximately 64 times faster as measured on my system using timeit on large data sets when compared to the enumerate() approach.
import numpy as np
from scipy.signal import lfilter
x = np.random.normal(size=1234)
alpha = .1 # smoothing coefficient
zi = [x[0]] # seed the filter state with first value
# filter can process blocks of continuous data if <zi> is maintained
y, zi = lfilter([1.-alpha], [1., -alpha], x, zi=zi)

I don't know Python, but for the averaging part, do you mean an exponentially decaying low-pass filter of the form
y_new = y_old + (input - y_old)*alpha
where alpha = dt/tau, dt = the timestep of the filter, tau = the time constant of the filter? (the variable-timestep form of this is as follows, just clip dt/tau to not be more than 1.0)
y_new = y_old + (input - y_old)*dt/tau
If you want to filter something like a date, make sure you convert to a floating-point quantity like # of seconds since Jan 1 1970.

My python is a little bit rusty (anyone can feel free to edit this code to make corrections, if I've messed up the syntax somehow), but here goes....
def movingAverageExponential(values, alpha, epsilon = 0):
if not 0 < alpha < 1:
raise ValueError("out of range, alpha='%s'" % alpha)
if not 0 <= epsilon < alpha:
raise ValueError("out of range, epsilon='%s'" % epsilon)
result = [None] * len(values)
for i in range(len(result)):
currentWeight = 1.0
numerator = 0
denominator = 0
for value in values[i::-1]:
numerator += value * currentWeight
denominator += currentWeight
currentWeight *= alpha
if currentWeight < epsilon:
break
result[i] = numerator / denominator
return result
This function moves backward, from the end of the list to the beginning, calculating the exponential moving average for each value by working backward until the weight coefficient for an element is less than the given epsilon.
At the end of the function, it reverses the values before returning the list (so that they're in the correct order for the caller).
(SIDE NOTE: if I was using a language other than python, I'd create a full-size empty array first and then fill it backwards-order, so that I wouldn't have to reverse it at the end. But I don't think you can declare a big empty array in python. And in python lists, appending is much less expensive than prepending, which is why I built the list in reverse order. Please correct me if I'm wrong.)
The 'alpha' argument is the decay factor on each iteration. For example, if you used an alpha of 0.5, then today's moving average value would be composed of the following weighted values:
today: 1.0
yesterday: 0.5
2 days ago: 0.25
3 days ago: 0.125
...etc...
Of course, if you've got a huge array of values, the values from ten or fifteen days ago won't contribute very much to today's weighted average. The 'epsilon' argument lets you set a cutoff point, below which you will cease to care about old values (since their contribution to today's value will be insignificant).
You'd invoke the function something like this:
result = movingAverageExponential(values, 0.75, 0.0001)

In matplotlib.org examples (http://matplotlib.org/examples/pylab_examples/finance_work2.html) is provided one good example of Exponential Moving Average (EMA) function using numpy:
def moving_average(x, n, type):
x = np.asarray(x)
if type=='simple':
weights = np.ones(n)
else:
weights = np.exp(np.linspace(-1., 0., n))
weights /= weights.sum()
a = np.convolve(x, weights, mode='full')[:len(x)]
a[:n] = a[n]
return a

I found the above code snippet by #earino pretty useful - but I needed something that could continuously smooth a stream of values - so I refactored it to this:
def exponential_moving_average(period=1000):
""" Exponential moving average. Smooths the values in v over ther period. Send in values - at first it'll return a simple average, but as soon as it's gahtered 'period' values, it'll start to use the Exponential Moving Averge to smooth the values.
period: int - how many values to smooth over (default=100). """
multiplier = 2 / float(1 + period)
cum_temp = yield None # We are being primed
# Start by just returning the simple average until we have enough data.
for i in xrange(1, period + 1):
cum_temp += yield cum_temp / float(i)
# Grab the timple avergae
ema = cum_temp / period
# and start calculating the exponentially smoothed average
while True:
ema = (((yield ema) - ema) * multiplier) + ema
and I use it like this:
def temp_monitor(pin):
""" Read from the temperature monitor - and smooth the value out. The sensor is noisy, so we use exponential smoothing. """
ema = exponential_moving_average()
next(ema) # Prime the generator
while True:
yield ema.send(val_to_temp(pin.read()))
(where pin.read() produces the next value I'd like to consume).

May be shortest:
#Specify decay in terms of span
#data_series should be a DataFrame
ema=data_series.ewm(span=5, adjust=False).mean()

import pandas_ta as ta
data["EMA3"] = ta.ema(data["close"], length=3)
pandas_ta is a Technical Analysis Library: https://github.com/twopirllc/pandas-ta. Above code calculates the Exponential Moving Average (EMA) for a series. You can specify the lag value using 'length'. Spesifically, above code calculates '3-day EMA'.

Here is a simple sample I worked up based on http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
Note that unlike in their spreadsheet, I don't calculate the SMA, and I don't wait to generate the EMA after 10 samples. This means my values differ slightly, but if you chart it, it follows exactly after 10 samples. During the first 10 samples, the EMA I calculate is appropriately smoothed.
def emaWeight(numSamples):
return 2 / float(numSamples + 1)
def ema(close, prevEma, numSamples):
return ((close-prevEma) * emaWeight(numSamples) ) + prevEma
samples = [
22.27, 22.19, 22.08, 22.17, 22.18, 22.13, 22.23, 22.43, 22.24, 22.29,
22.15, 22.39, 22.38, 22.61, 23.36, 24.05, 23.75, 23.83, 23.95, 23.63,
23.82, 23.87, 23.65, 23.19, 23.10, 23.33, 22.68, 23.10, 22.40, 22.17,
]
emaCap = 10
e=samples[0]
for s in range(len(samples)):
numSamples = emaCap if s > emaCap else s
e = ema(samples[s], e, numSamples)
print e

I'm a little late to the party here, but none of the solutions given were what I was looking for. Nice little challenge using recursion and the exact formula given in investopedia.
No numpy or pandas required.
prices = [{'i': 1, 'close': 24.5}, {'i': 2, 'close': 24.6}, {'i': 3, 'close': 24.8}, {'i': 4, 'close': 24.9},
{'i': 5, 'close': 25.6}, {'i': 6, 'close': 25.0}, {'i': 7, 'close': 24.7}]
def rec_calculate_ema(n):
k = 2 / (n + 1)
price = prices[n]['close']
if n == 1:
return price
res = (price * k) + (rec_calculate_ema(n - 1) * (1 - k))
return res
print(rec_calculate_ema(3))

A fast way (copy-pasted from here) is the following:
def ExpMovingAverage(values, window):
""" Numpy implementation of EMA
"""
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a

I am using a list and a rate of decay as inputs. I hope this little function with just two lines may help you here, considering deep recursion is not stable in python.
def expma(aseries, ratio):
return sum([ratio*aseries[-x-1]*((1-ratio)**x) for x in range(len(aseries))])

more simply, using pandas
def EMA(tw):
for x in tw:
data["EMA{}".format(x)] = data['close'].ewm(span=x, adjust=False).mean()
EMA([10,50,100])

Papahaba's answer was almost what I was looking for (thanks!) but I needed to match initial conditions. Using an IIR filter with scipy.signal.lfilter is certainly the most efficient. Here's my redux:
Given a NumPy vector, x
import numpy as np
from scipy import signal
period = 12
b = np.array((1,), 'd')
a = np.array((period, 1-period), 'd')
zi = signal.lfilter_zi(b, a)
y, zi = signal.lfilter(b, a, x, zi=zi*x[0:1])
Get the N-point EMA (here, 12) returned in the vector y

Efficient way to implement simple filter with varying coeffients in Python/Numpy

I am looking for an efficient way to implement a simple filter with one coefficient that is time-varying and specified by a vector with the same length as the input signal.
The following is a simple implementation of the desired behavior:
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
output = myfilter(signal, weights)
Is there a way to do this more efficiently with numpy or scipy?

You can trade in the overhead of the loop for a couple of additional ops:
import numpy as np
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
def vectorised(signal, weights):
wp = np.r_[1, np.multiply.accumulate(1 - weights[1:])]
sw = weights * signal
sw[0] = signal[0]
sws = np.add.accumulate(sw / wp)
return wp * sws
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
print(np.allclose(myfilter(signal, weights), vectorised(signal, weights)))
On my machine the vectorised version is several times faster. It uses a "closed form" solution of your recurrence equation.
Edit: For very long signal / weight (100,000 samples, say) this method doesn't work because of overflow. In that regime you can still save a bit (more than 50% on my machine) using the following trick, which has the added bonus that you needn't solve the recurrence formula, only invert it.
from scipy import linalg
def solver(signal, weights):
rw = 1 / weights[1:]
v = np.r_[1, rw, 1-rw, 0]
v.shape = 2, -1
return linalg.solve_banded((1, 0), v, signal)
This trick uses the fact that your recurrence is formally similar to a Gauss elimination on a matrix with only one nonvanishing subdiagonal. It piggybacks on a library function that specialises in doing precisely that.
Actually, quite proud of this one.

Improving performance of Cronbach Alpha code python numpy

I made some code for calculating Cronbach Alpha that works. But I am not too good using lambda functions. Is there a way to reduce the code and improve efficiency by using lambda instead of the svar() function and getting rid of some of the for loops by using numpy arrays?
import numpy as np
def svar(X):
n = float(len(X))
svar=(sum([(x-np.mean(X))**2 for x in X]) / n)* n/(n-1.)
return svar
def CronbachAlpha(itemscores):
itemvars = [svar(item) for item in itemscores]
tscores = [0] * len(itemscores[0])
for item in itemscores:
for i in range(len(item)):
tscores[i]+= item[i]
nitems = len(itemscores)
#print "total scores=", tscores, 'number of items=', nitems
Calpha=nitems/(nitems-1.) * (1-sum(itemvars)/ svar(tscores))
return Calpha
###########Test################
itemscores = [[ 4,14,3,3,23,4,52,3,33,3],
[ 5,14,4,3,24,5,55,4,15,3]]
print "Cronbach alpha = ", CronbachAlpha(itemscores)

def CronbachAlpha(itemscores):
itemscores = numpy.asarray(itemscores)
itemvars = itemscores.var(axis=1, ddof=1)
tscores = itemscores.sum(axis=0)
nitems = len(itemscores)
return nitems / (nitems-1.) * (1 - itemvars.sum() / tscores.var(ddof=1))
NumPy has a variance function built in. Specifying ddof=1 uses a denominator of N-1, giving a sample variance. There's also a sum builtin.

As Julien Marrec mentioned I suggest the following refactoring of the CronbachAlpha:
def CronbachAlpha(itemscores):
# cols are items, rows are observations
itemscores = np.asarray(itemscores)
itemvars = itemscores.var(axis=0, ddof=1)
tscores = itemscores.sum(axis=1)
nitems = len(itemscores.columns)
return (nitems / (nitems-1)) * (1 - (itemvars.sum() / tscores.var(ddof=1)))

Same as the other answers, just a bit more Pythonic. X is a data matrix -- that is, the rows are samples, the columns are items. X may be a numpy array or pandas DataFrame.
def cronbach_alpha(X):
num_items = X.shape[1]
sum_of_item_variances = X.var(axis=0).sum()
variance_of_sum_of_items = X.sum(axis=1).var()
return num_items/(num_items - 1)*(1 - sum_of_item_variances/variance_of_sum_of_items)
(It's not necessary to specify ddof, as the term appears in the denominator and numerator, and cancels.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

EWMA Covariance Matrix in Pandas - Optimization - python

Related

Vectorization for computing variance of a vector split at different points

Calculating XIRR in Python

Exponential Moving Average by time interval [duplicate]

Efficient way to implement simple filter with varying coeffients in Python/Numpy

Improving performance of Cronbach Alpha code python numpy

Categories

Resources