Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.
Related
I was looking through the pandas.query documentation but couldn't find anything specific about this.
Is it possible to perform a query on a date based on the closest date to the one given, instead of a specific date?
For example lets say we use the wine dataset and creates some random dates.
import pandas as pd
import numpy as np
from sklearn import datasets
dir(datasets)
df = pd.DataFrame(datasets.load_wine().data)
df.columns = datasets.load_wine().feature_names
df.columns=df.columns.str.strip()
def random_dates(start, end, n, unit='D'):
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2022-01-01')
datelist=random_dates(start, end, 178)
df['Dates'] = datelist
if you perform a simple query on hue
df.query('hue == 0.6')
you'll receive three rows with three random dates. Is it possible to pick the query result that's closest to let's say 2017-1-1?
so something like
df.query('hue==0.6').query('Date ~2017-1-1')
I hope this makes sense!
You can use something like:
df.query("('2018-01-01' < Dates) & (Dates < '2018-01-31')")
# Output
alcohol malic_acid ... proline Dates
6 14.39 1.87 ... 1290.0 2018-01-24 08:21:14.665824000
41 13.41 3.84 ... 1035.0 2018-01-22 22:15:56.547561600
51 13.83 1.65 ... 1265.0 2018-01-26 22:37:26.812156800
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
142 13.52 3.17 ... 520.0 2018-01-19 22:37:10.170825600
[6 rows x 14 columns]
Or using #variables:
date = pd.to_datetime('2018-01-01')
offset = pd.DateOffset(days=10)
start = date - offset
end = date + offset
df.query("Dates.between(#start, #end)")
# Output
alcohol malic_acid ... proline Dates
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
Given a series, find the entries closest to a given date:
def closest_to_date(series, date, n=5):
date = pd.to_datetime(date)
return abs(series - date).nsmallest(n)
Then we can use the index of the returned series to select further rows (or you change the api to suit you):
(df.loc[df.hue == 0.6]
.loc[lambda df_: closest_to_date(df_.Dates, "2017-1-1", n=1).index]
)
I'm not sure if you have to use query, but this will give you the results you are looking for
df['Count'] = (df[df['hue'] == .6].sort_values(['Dates'], ascending=True)).groupby(['hue']).cumcount() + 1
df.loc[df['Count'] == 1]
Here's a sample dataset with observations from 4 different trips (there are 4 unique trip IDs):
trip_id time_interval speed
0 8a8449635c10cc4b8e7841e517f27e2652c57ea3 873.96 0.062410
1 8a8449635c10cc4b8e7841e517f27e2652c57ea3 11.46 0.000000
2 8a8449635c10cc4b8e7841e517f27e2652c57ea3 903.96 0.247515
3 8a8449635c10cc4b8e7841e517f27e2652c57ea3 882.48 0.121376
4 8a8449635c10cc4b8e7841e517f27e2652c57ea3 918.78 0.185405
5 8a8449635c10cc4b8e7841e517f27e2652c57ea3 885.96 0.122147
6 f7fd70a8c14e43d8be91ef180e297d7195bbe9b0 276.60 0.583178
7 84d14618dcb30c28520cb679e867593c1d29213e 903.48 0.193313
8 84d14618dcb30c28520cb679e867593c1d29213e 899.34 0.085377
9 84d14618dcb30c28520cb679e867593c1d29213e 893.46 0.092259
10 84d14618dcb30c28520cb679e867593c1d29213e 849.36 0.350341
11 3db35f9835db3fe550de194b55b3a90a6c1ecb97 10.86 0.000000
12 3db35f9835db3fe550de194b55b3a90a6c1ecb97 919.50 0.071119
I am trying to compute the acceleration of each unique trip from one point to another.
Example:
first acceleration value will be computed using rows 0 and 1 (0 initial value; 1 final value)
second acceleration value will be computed using rows 1 and 2 (1 initial value; 2 final value)
... and so on.
As I want to compute this for each individual trip based on trip_id, this is what I attempted:
def get_acceleration(dataset):
##### INITIALISATION VARIABLES #####
# Empty string for the trip ID
current_trip_id = ""
# Copy of the dataframe
dataset2 = dataset.copy()
# Creating a new column for the acceleration between observations of the same trip
# all rows have a default value of 0
dataset2["acceleration"] = 0
##### LOOP #####
for index,row in dataset.iterrows():
# Checking if we are looking at the same trip
# when looking at the same trip, the default values of zero are replaced
# by the calculated trip characteristic
if row["trip_id"] == current_trip_id:
# Final speed and time
final_speed = row["speed"]
final_time = row["time_interval"]
print(type(final_speed))
# Computing the acceleration (delta_v/ delta_t)
acceleration = (final_speed[1] - initial_speed[0])/(final_time[1] - initial_time[0])
# Adding the output to the acceleration column
dataset2.loc[index, "acceleration"] = acceleration
##### UPDATING THE LOOP #####
current_trip_id = row["trip_id"]
# Initial speed and time
initial_speed = row["speed"]
initial_time = row["time_interval"]
return dataset2
However, I get the error:
<ipython-input-42-0255a952850b> in get_acceleration(dataset)
27
28 # Computing the acceleration (delta_v/ delta_t)
---> 29 acceleration = (final_speed[1] - initial_speed[0])/(final_time[1] - initial_time[0])
30
31 # Adding the output to the acceleration column
TypeError: 'float' object is not subscriptable
How could I fix this error and compute the acceleration?
UPDATE:
After using the answer below, to avoid division by zero just add an if and else statements.
delta_speed = final_speed - initial_speed
delta_time = final_time - initial_time
# Computing the acceleration (delta_v/ delta_t)
if delta_time != 0:
acceleration = (delta_speed)/(delta_time)
else:
acceleration = 0
It works
acceleration = (final_speed - initial_speed)/(final_time - initial_time)
trip_id
time_interval
speed
acceleration
0
8a8449635c10cc4b8e7841e517f27e2652c57ea3
873.96
0.062410
0.000000
1
8a8449635c10cc4b8e7841e517f27e2652c57ea3
11.46
0.000000
0.000072
2
8a8449635c10cc4b8e7841e517f27e2652c57ea3
903.96
0.247515
0.000277
3
8a8449635c10cc4b8e7841e517f27e2652c57ea3
882.48
0.121376
0.005872
4
8a8449635c10cc4b8e7841e517f27e2652c57ea3
918.78
0.185405
0.001764
5
8a8449635c10cc4b8e7841e517f27e2652c57ea3
885.96
0.122147
0.001927
6
f7fd70a8c14e43d8be91ef180e297d7195bbe9b0
276.60
0.583178
0.000000
7
84d14618dcb30c28520cb679e867593c1d29213e
903.48
0.193313
0.000000
8
84d14618dcb30c28520cb679e867593c1d29213e
899.34
0.085377
0.026071
9
84d14618dcb30c28520cb679e867593c1d29213e
893.46
0.092259
0.001170
10
84d14618dcb30c28520cb679e867593c1d29213e
849.36
0.350341
-0.005852
11
3db35f9835db3fe550de194b55b3a90a6c1ecb97
10.86
0.000000
0.000000
12
3db35f9835db3fe550de194b55b3a90a6c1ecb97
919.50
0.071119
0.000078
I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.
I have a set up like this - but with different numbers:
df1 = {'Date': ['2020-01-06', '2020-01-07', '2020-01-08','2020-01-09', '2020-01-10', '2020-01-13','2020-01-14','2020-01-15','2020-01-16', '2020-01-17', '2020-01-20'],
'Return': '0.02', '0.004','0.006', '0.001','0.005', '0.01','0.015', '0.001','0.0014',
'0.04', '0.037'}
weights = [1,2,3]
What I need to do is multiply the last 3 Returns by the weights column, sum them, then sqrt the answer - . Then store that against 2020-01-20. I then need to multiply the last 3 returns EXCEPT last line (so shifted 1 row) by weights,sum, sqrt and store answer against 2020-01-17 and so on.
So my output column would be blank for the first 2 rows, then have 9 populated entries from 2020-01-08 onwards.
So I need to do the calc, shift the column 1 row, and then repeat, but saving the summed sqrt'd return each time.
I can calculate the one off (just for last 3 rows) correctly using :
df_last_3 = df.iloc[-3:].reset_index()
df_last_3['return*weights'] = df_last_3 * weights
sqrt_return= (np.sqrt((df_last_3['return*weights']).sum()))
But I then need to perform the same calculation on the shifted column rows - and store the result.
I'm new to Python and not practised enough with loops to really figure it out. I've had a go but didnt get the results I was after.
I've looked all round for examples of this and still can't get the solution. Any help would be appreciated.
Use, Series.rolling with a window size of 3, then use .apply to process the rolling window according to the requirements:
df1['Result'] = df1['Return'].rolling(3).apply(
lambda s: np.sqrt((s * weights[::-1]).sum()))
# print(df1)
Date Return Result
0 2020-01-06 0.02 NaN
1 2020-01-07 0.004 NaN
2 2020-01-08 0.006 0.272029
3 2020-01-09 0.001 0.158114
4 2020-01-10 0.005 0.158114
5 2020-01-13 0.01 0.151658
6 2020-01-14 0.015 0.223607
7 2020-01-15 0.001 0.246982
8 2020-01-16 0.0014 0.220000
9 2020-01-17 0.04 0.214009
10 2020-01-20 0.037 0.348138
# Calculations:
# 0.348138 = sqrt(1 * 0.037 + 2 * 0.04 + 3 * 0.0014)
# 0.214009 = sqrt(1 * 0.04 + 2 * 0.0014 + 3 * 0.001)
# ...
I have two pandas dataframes, each with two columns: a measurement and a timestamp. I need to multiply the first differences of the measurements, but only if there is a time overlap between the two measurement intervals. How can I do this efficiently, as the size of the dataframes gets large?
Example:
dfA
mesA timeA
0 125 2015-01-14 04:44:49
1 100 2015-01-14 05:16:23
2 115 2015-01-14 08:57:10
dfB
mesB timeB
0 140 2015-01-14 00:13:17
1 145 2015-01-14 08:52:01
2 120 2015-01-14 11:31:44
Here I would multiply (100-125)*(145-140) since there is a time overlap between the intervals [04:44:49, 05:16:23] and [00:13:17, 08:52:01], but not (100-125)and(120-145), since there isn't one. Similarly, I would have (115-100)*(145-140) but also (115-100)*(120-145), since both have a time overlap.
In the end I will have to sum all the relevant products in a single value, so the result need not be a dataframe. In this case:
s = (100-125)*(145-140)+(115-100)*(145-140)+(115-100)*(120-145) = -425
My current solution:
s = 0
for i in range(1, len(dfA)):
startA = dfA['timeA'][i-1]
endA = dfA['timeA'][i]
for j in range(1, len(dfB)):
startB = dfB['timeB'][j-1]
endB = dfB['timeB'][j]
if (endB>startA) & (startB<endA):
s+=(dfA['mesA'][i]-dfA['mesA'][i-1])*(dfB['mesB'][j]-dfB['mesB'][j-1])
Although it seems to work, it is very inefficient and becomes impractical with very large datasets. I believe it could be vectorized more efficiently, perhaps using numexpr, but I still haven't found a way.
EDIT:
other data
mesA timeA
0 125 2015-01-14 05:54:03
1 100 2015-01-14 11:39:53
2 115 2015-01-14 23:58:13
mesB timeB
0 110 2015-01-14 10:58:32
1 120 2015-01-14 13:30:00
2 135 2015-01-14 22:29:26
s = 125
Edit: the original answer did not work, so I came up with another version that is not vectorize but they need to be sorted by date.
arrA = dfA.timeA.to_numpy()
startA, endA = arrA[0], arrA[1]
arr_mesA = dfA.mesA.diff().to_numpy()
mesA = arr_mesA[1]
arrB = dfB.timeB.to_numpy()
startB, endB = arrB[0], arrB[1]
arr_mesB = dfB.mesB.diff().to_numpy()
mesB = arr_mesB[1]
s = 0
i, j = 1, 1
imax = len(dfA)-1
jmax = len(dfB)-1
while True:
if (endB>startA) & (startB<endA):
s+=mesA*mesB
if (endB>endA) and (i<imax):
i+=1
startA, endA, mesA= endA, arrA[i], arr_mesA[i]
elif j<jmax:
j+=1
startB, endB, mesB = endB, arrB[j], arr_mesB[j]
else:
break
Original not working answer
The idea is to great category with pd.cut based on the value in dfB['timeB'] in both dataframes to see where they could overlap. Then calculate the diff in measurements. merge both dataframes on categories and finally multiply and sum the whole thing
# create bins
bins_dates = [min(dfB['timeB'].min(), dfA['timeA'].min())-pd.DateOffset(hours=1)]\
+ dfB['timeB'].tolist()\
+ [max(dfB['timeB'].max(), dfA['timeA'].max())+pd.DateOffset(hours=1)]
# work on dfB
dfB['cat'] = pd.cut(dfB['timeB'], bins=bins_dates,
labels=range(len(bins_dates)-1), right=False)
dfB['deltaB'] = -dfB['mesB'].diff(-1).ffill()
# work on dfA
dfA['cat'] = pd.cut(dfA['timeA'], bins=bins_dates,
labels=range(len(bins_dates)-1), right=False)
# need to calcualte delta for both start and end of intervals
dfA['deltaAStart'] = -dfA['mesA'].diff(-1)
dfA['deltaAEnd'] = dfA['mesA'].diff().mask(dfA['cat'].astype(float).diff().eq(0))
# in the above method, for the end of interval, use a mask to not count twice
# intervals that are fully included in one interval of B
# then merge and calcualte the multiplication you are after
df_ = dfB[['cat', 'deltaB']].merge(dfA[['cat','deltaAStart', 'deltaAEnd']])
s = (df_['deltaB'].to_numpy()[:,None]*df_[['deltaAStart', 'deltaAEnd']]).sum().sum()
print (s)
#-425.0