Calculating volatility manually vs built-in functions are not the same - python

Can someone help me to understand where I'm wrong? I don't know why I get different volatility of each column...
This is an example of my code:
from math import sqrt
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from statistics import stdev
data = around(a=uniform(low=1.0, high=50.0, size=(500, 1)), decimals=3)
df = DataFrame(data=data, columns=['close'], dtype='float64')
df.loc[:, 'delta'] = df.loc[:, 'close'].pct_change().fillna(0).round(3)
volatility = []
for index in range(df.shape[0]):
if index < 90:
volatility.append(0)
else:
start = index - 90
stop = index + 1
volatility.append(stdev(df.loc[start:stop, 'delta']) * sqrt(252))
df.loc[:, 'volatility1'] = volatility
df.loc[:, 'volatility2'] = df.loc[:, 'delta'].rolling(window=90).std(ddof=0) * sqrt(252)
print(df)
close delta volatility1 volatility2
0 10.099 0.000 0.000000 NaN
1 26.331 1.607 0.000000 NaN
2 32.361 0.229 0.000000 NaN
3 2.068 -0.936 0.000000 NaN
4 36.241 16.525 0.000000 NaN
.. ... ... ... ...
495 48.015 -0.029 46.078037 46.132943
496 6.988 -0.854 46.036210 46.178820
497 23.331 2.339 46.003184 45.837245
498 25.551 0.095 45.608260 45.792188
499 46.248 0.810 45.793012 45.769787
[500 rows x 4 columns]
Thanks you so much!

There are three small changes needed. Added comments inline. 89 is needed since endpoint inclusive (unlike a lot of other python stuff). ddof=1 is needed because stdev uses this by default. This article talks about numpy std instead of stdev but the theory of what ddof is doing is still the same.
Also, in the future, try changing size to something like 95. You don't need the other 405 rows when debugging and it is nice to see the changeover from 0/NaN to actual volatility to see you need 89 not 90.
The 0 vs NaN difference still exists. This is a result of you appending 0 and rolling's default behavior. I wasn't sure if that was intentional or not so I left it.
from math import sqrt
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from statistics import stdev
data = around(a=uniform(low=1.0, high=50.0, size=(500, 1)), decimals=3)
df = DataFrame(data=data, columns=['close'], dtype='float64')
df['delta'] = df['close'].pct_change().fillna(0).round(3)
volatility = []
for index in range(df.shape[0]):
if index < 89: #change to 89
volatility.append(0)
else:
start = index - 89 #change to 89
stop = index
volatility.append(stdev(df.loc[start:stop, 'delta']) * sqrt(252))
df['volatility1'] = volatility
df['volatility2'] = df.loc[:, 'delta'].rolling(window=90).std(ddof=1) * sqrt(252) #change to ddof=1
print(df)

Related

Errors attempting to use linearmodels.panel.PanelOLS entity effects (not time effects)

I have a Pandas DataFrame like (abridged):
age
gender
control
county
11877
67.0
F
0
AL-Calhoun
11552
60.0
F
0
AL-Coosa
11607
60.0
F
0
AL-Talladega
13821
NaN
NaN
1
AL-Mobile
11462
59.0
F
0
AL-Dale
I want to run a linear regression with fixed effects by county entity (not by time) to balance check my control and treatment groups for an experimental design, such that my dependent variable is membership in the treatment group (control = 1) or not (control = 0).
In order to do this, so far as I have seen I need to use linearmodels.panel.PanelOLS and set my entity field (county) as my index.
So far as I'm aware my model should look like this:
# set index on entity effects field:
to_model = to_model.set_index(["county"])
# implement fixed effects linear model
model = PanelOLS.from_formula("control ~ age + gender + EntityEffects", to_model)
When I try to do this, I get the below error:
ValueError: The index on the time dimension must be either numeric or date-like
I have seen a lot of implementations of such models online and they all seem to use a temporal effect, which is not relevant in my case. If I try to encode my county field using numerics, I get a different error.
# create a dict to map county values to numerics
county_map = dict(zip(to_model["county"].unique(), range(len(to_model.county.unique()))))
# create a numeric column as alternative to county
to_model["county_numeric"] = to_model["county"].map(county_map)
# set index on numeric entity effects field
to_model = to_model.set_index(["county_numeric"])
FactorEvaluationError: Unable to evaluate factor `control`. [KeyError: 'control']
How am I able to implement this model using the county as a unit fixed effect?
Assuming you have multiple entries for each county, then you could use the following. The key step is to use a groupby transform to create a distinct numeric index for each county which can be used as a fake time index.
import numpy as np
import pandas as pd
import string
import linearmodels as lm
# Generate randomd DF
rs = np.random.default_rng(1213892)
counties = rs.choice([c for c in string.ascii_lowercase], (1000, 3))
counties = np.array([["".join(c)] * 10 for c in counties]).ravel()
age = rs.integers(18, 65, (10 * 1000))
gender = rs.choice(["m", "f"], size=(10 * 1000))
control = rs.integers(0, 2, size=10 * 1000)
df = pd.DataFrame(
{"counties": counties, "age": age, "gender": gender, "control": control}
)
# Construct a dummy numeric index for each county
numeric_index = df.groupby("counties").age.transform(lambda c: np.arange(len(c)))
df["numeric_index"] = numeric_index
df = df.set_index(["counties","numeric_index"])
# Take a look
df.head(15)
age gender control
counties numeric_index
qbt 0 51 m 1
1 36 m 0
2 28 f 1
3 28 m 0
4 47 m 0
5 19 m 1
6 32 m 1
7 54 m 0
8 36 m 1
9 52 m 0
nub 0 19 m 0
1 57 m 0
2 49 f 0
3 53 m 1
4 30 f 0
This just shows that the model can be estimated.
# Fit the model
# Note: Results are meaningless, just shows that this works
lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod = lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod.fit()
PanelOLS Estimation Summary
================================================================================
Dep. Variable: control R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.0005
No. Observations: 10000 R-squared (Within): 0.0003
Date: Thu, May 12 2022 R-squared (Overall): 0.0003
Time: 11:08:00 Log-likelihood -6768.3
Cov. Estimator: Unadjusted
F-statistic: 1.4248
Entities: 962 P-value 0.2406
Avg Obs: 10.395 Distribution: F(2,9036)
Min Obs: 10.0000
Max Obs: 30.000 F-statistic (robust): 2287.4
P-value 0.0000
Time periods: 30 Distribution: F(2,9036)
Avg Obs: 333.33
Min Obs: 2.0000
Max Obs: 962.00
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
age -0.0002 0.0004 -0.5142 0.6072 -0.0010 0.0006
gender[T.f] 0.5191 0.0176 29.559 0.0000 0.4847 0.5535
gender[T.m] 0.5021 0.0175 28.652 0.0000 0.4678 0.5365
===============================================================================
F-test for Poolability: 0.9633
P-value: 0.7768
Distribution: F(961,9036)
Included effects: Entity
PanelEffectsResults, id: 0x2246f38a9d0

Pandas: select the first value which is not negative anymore, return the row

For now my code looks like this:
df = pd.DataFrame()
max_exp = []
gammastar = []
for idx,rw in df_gamma_count.iterrows():
exp = rw['Pr_B']*(rw['gamma_index']*float(test_spread)*(1+f)-(f+f))
df = df.append({'exp': exp, 'gamma_perc': rw['gamma_index'], 'Pr_B':rw['Pr_B'], 'spread-test in %': test_spread }, ignore_index=True)
df = df.sort_values(by= ['exp'], ascending=True)
df
which gives me the following dataframe:
Pr_B exp gamma_perc spread-test in %
10077 0.000066 -2.078477e-08 1.544700 0.001090292473058004120128368625
10078 0.000066 -2.073422e-08 1.545400 0.001090292473058004120128368625
10079 0.000066 -2.071978e-08 1.545600 0.001090292473058004120128368625
10080 0.000066 -2.071256e-08 1.545700 0.001090292473058004120128368625
10081 0.000000 -0.000000e+00 1.545900 0.001090292473058004120128368625
10082 0.000000 -0.000000e+00 1.546200 0.001090292473058004120128368625
10083 0.000000 0.000000e+00 1.546300 0.001090292473058004120128368625
10084 0.000000 1 1.546600 0.001090292473058004120128368625
What I need now is to select the first value from the column exp which is not negative anymore. What I did for now is to sort the dataframe based on the column exp but after that I am a bit stuck and do not know where to go... any idea?
Try:
df.loc[df.exp.gt(0).idxmax()]
this will - select the first value from the column exp which is not negative anymore
if you are tying to get the largest value in a series
df.exp.nlargest(1)
EDIT:
Use this to get your desired output:
df.loc[df.exp==np.where(all(i > 0 for i in df.exp.tolist()),min([n for n in df.exp.tolist() if n<=0]),max([n for n in df.exp.tolist() if n<=0]))]
print(df.loc[df.exp==np.where(all(i > 0 for i in df.exp.tolist()),min([n for n in df.exp.tolist() if n<=0]),max([n for n in df.exp.tolist() if n<=0]))].head(1))
Pr_B exp gamma_perc spread-test in %
4 0.0 0.0 1.5459 0.00109
I would screen for number larger than 0 and get the first index
data = [-1,-2,-3, 0]
df = pd.DataFrame(data, columns=['exp'])
value = df.exp[df.exp >= 0].iloc[0] if df.exp[df.exp >= 0].any() else df.exp.max()

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

Calculate max draw down with a vectorized solution in python

Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

Python pandas multiplying 2 columns results in incorrect product

I have the following data:
vwapDataGMD.head()
Out[311]:
price size return logP priceVol
time
2013-01-02 08:00:03 29.280000 800 NaN 3.376905 78863044.800000
2013-01-02 08:00:05 29.308889 900 0.000986 3.377891 78940854.422222
2013-01-02 08:15:29 29.314348 230 0.000186 3.378077 78955557.578261
2013-01-02 08:24:21 29.400000 158 0.002918 3.380995 79186254.000000
2013-01-02 08:35:48 29.400000 100 0.000000 3.380995 79186254.000000
When I multiply the price and size columns, I get the priceVol column which is incorrect. For e.g. 29.28 * 800 = priceVol = 23424 but I am getting a high number in priceVol = 78863044.800
My code was the following:
vwapDataGMD['priceVol'] = vwapDataGMD.price * vwapDataGMD.size
What am I doing wrong?
I think it's because you use vwapDataGMD.size to access the column. But pandas think it as the keyword size and hence returns the length of dataframe. Use the following instead.
vwapDataGMD['priceVol'] = vwapDataGMD['price'] * vwapDataGMD['size']

Categories

Resources