Here's a sample dataset with observations from 4 different trips (there are 4 unique trip IDs):
trip_id time_interval speed
0 8a8449635c10cc4b8e7841e517f27e2652c57ea3 873.96 0.062410
1 8a8449635c10cc4b8e7841e517f27e2652c57ea3 11.46 0.000000
2 8a8449635c10cc4b8e7841e517f27e2652c57ea3 903.96 0.247515
3 8a8449635c10cc4b8e7841e517f27e2652c57ea3 882.48 0.121376
4 8a8449635c10cc4b8e7841e517f27e2652c57ea3 918.78 0.185405
5 8a8449635c10cc4b8e7841e517f27e2652c57ea3 885.96 0.122147
6 f7fd70a8c14e43d8be91ef180e297d7195bbe9b0 276.60 0.583178
7 84d14618dcb30c28520cb679e867593c1d29213e 903.48 0.193313
8 84d14618dcb30c28520cb679e867593c1d29213e 899.34 0.085377
9 84d14618dcb30c28520cb679e867593c1d29213e 893.46 0.092259
10 84d14618dcb30c28520cb679e867593c1d29213e 849.36 0.350341
11 3db35f9835db3fe550de194b55b3a90a6c1ecb97 10.86 0.000000
12 3db35f9835db3fe550de194b55b3a90a6c1ecb97 919.50 0.071119
I am trying to compute the acceleration of each unique trip from one point to another.
Example:
first acceleration value will be computed using rows 0 and 1 (0 initial value; 1 final value)
second acceleration value will be computed using rows 1 and 2 (1 initial value; 2 final value)
... and so on.
As I want to compute this for each individual trip based on trip_id, this is what I attempted:
def get_acceleration(dataset):
##### INITIALISATION VARIABLES #####
# Empty string for the trip ID
current_trip_id = ""
# Copy of the dataframe
dataset2 = dataset.copy()
# Creating a new column for the acceleration between observations of the same trip
# all rows have a default value of 0
dataset2["acceleration"] = 0
##### LOOP #####
for index,row in dataset.iterrows():
# Checking if we are looking at the same trip
# when looking at the same trip, the default values of zero are replaced
# by the calculated trip characteristic
if row["trip_id"] == current_trip_id:
# Final speed and time
final_speed = row["speed"]
final_time = row["time_interval"]
print(type(final_speed))
# Computing the acceleration (delta_v/ delta_t)
acceleration = (final_speed[1] - initial_speed[0])/(final_time[1] - initial_time[0])
# Adding the output to the acceleration column
dataset2.loc[index, "acceleration"] = acceleration
##### UPDATING THE LOOP #####
current_trip_id = row["trip_id"]
# Initial speed and time
initial_speed = row["speed"]
initial_time = row["time_interval"]
return dataset2
However, I get the error:
<ipython-input-42-0255a952850b> in get_acceleration(dataset)
27
28 # Computing the acceleration (delta_v/ delta_t)
---> 29 acceleration = (final_speed[1] - initial_speed[0])/(final_time[1] - initial_time[0])
30
31 # Adding the output to the acceleration column
TypeError: 'float' object is not subscriptable
How could I fix this error and compute the acceleration?
UPDATE:
After using the answer below, to avoid division by zero just add an if and else statements.
delta_speed = final_speed - initial_speed
delta_time = final_time - initial_time
# Computing the acceleration (delta_v/ delta_t)
if delta_time != 0:
acceleration = (delta_speed)/(delta_time)
else:
acceleration = 0
It works
acceleration = (final_speed - initial_speed)/(final_time - initial_time)
trip_id
time_interval
speed
acceleration
0
8a8449635c10cc4b8e7841e517f27e2652c57ea3
873.96
0.062410
0.000000
1
8a8449635c10cc4b8e7841e517f27e2652c57ea3
11.46
0.000000
0.000072
2
8a8449635c10cc4b8e7841e517f27e2652c57ea3
903.96
0.247515
0.000277
3
8a8449635c10cc4b8e7841e517f27e2652c57ea3
882.48
0.121376
0.005872
4
8a8449635c10cc4b8e7841e517f27e2652c57ea3
918.78
0.185405
0.001764
5
8a8449635c10cc4b8e7841e517f27e2652c57ea3
885.96
0.122147
0.001927
6
f7fd70a8c14e43d8be91ef180e297d7195bbe9b0
276.60
0.583178
0.000000
7
84d14618dcb30c28520cb679e867593c1d29213e
903.48
0.193313
0.000000
8
84d14618dcb30c28520cb679e867593c1d29213e
899.34
0.085377
0.026071
9
84d14618dcb30c28520cb679e867593c1d29213e
893.46
0.092259
0.001170
10
84d14618dcb30c28520cb679e867593c1d29213e
849.36
0.350341
-0.005852
11
3db35f9835db3fe550de194b55b3a90a6c1ecb97
10.86
0.000000
0.000000
12
3db35f9835db3fe550de194b55b3a90a6c1ecb97
919.50
0.071119
0.000078
Related
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I am attempting to interpolate a value based on a number's position in a different column. Take this column for instance:
Coupon Price
9.5 109.04
9.375 108.79
9.25 108.54
9.125 108.29
9 108.04
8.875 107.79
8.75 107.54
8.625 107.29
8.5 107.04
8.375 106.79
8.25 106.54
Lets say I have a number like 107. I want to be able to find 107's relative distance from both 107.04 and 106.79 to interpolate the value that has the same relative distance between 8.5 and 8.375, the coupon values at the same index. Is this possible? I can solve this in excel using the FORECAST method, but want to know if it can be done in Python.
Welcome to Stack Overflow.
We need to make a custom function for this, unless there's a standard library function I'm unaware, which is entirely possible. I'm going to make a function that allows you to enter a bond by price and it will get inserted into the dataframe with the appropriate coupon.
Assuming we are starting with a sorted dataframe.
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.375 106.79
10 8.250 106.54
I've inserted comments into the function.
def add_bond(Price, df):
# Add row
df.loc[df.shape[0]] = [np.NaN, Price]
df = df.sort_values('Price', ascending=False).reset_index(drop=True)
# Get index
idx = df[df['Price'] == Price].head(1).index.tolist()[0]
# Get the distance from Prices from previous row to next row
span = abs(df.iloc[idx-1, 1] - df.iloc[idx +1, 1]).round(4)
# Get the distance and direction from Price from previous row to new value
terp = (df.iloc[idx, 1] - df.iloc[idx-1, 1]).round(4)
# Find the percentage movement from previous in percentage.
moved = terp / span
# Finally calculate the move from the previous for Coupon.
df.iloc[idx, 0] = df.iloc[idx-1,0] + (abs(df.iloc[idx-1,0] - df.iloc[idx+1, 0]) * (moved))
return df
A function to calculate the Coupon of a new bond using Price in the DataFrame.
# Add 107
df = add_bond(107, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.480 107.00
10 8.375 106.79
11 8.250 106.54
Add one more.
# Add 107.9
df = add_bond(107.9, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.930 107.90
6 8.875 107.79
7 8.750 107.54
8 8.625 107.29
9 8.500 107.04
10 8.480 107.00
11 8.375 106.79
12 8.250 106.54
If this answer meets your needs, please remember to select correct answer. Thanks.
Probably there's a function that does the work for you somewhere but my advice is to program it yourself, it's not difficult at all and it's a nice programming excercise. Just find the slope in that segment and use the equation a straight line:
(y-y0) = ((y1-y0)/(x1-x0))*(x-x0) -> y = ((y1-y0)/(x1-x0))*(x-x0) + y0
Where:
x -> Your given value (107)
x1 & x0 -> The values right above and below (107.04 & 106.79)
y1 & y0 -> The corresponding values to x1 & x0 (8.5 & 8.375)
y -> Your target value.
Just basic high-school maths ;-)
Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.