Python: calculate rolling returns over different frequencies - python

I have the following DataFrame:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I am applying the following to get rolling returns:
periodicity_dict = {1:'daily', 7:'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in range(key, len(df1[col][df1[col].first_valid_index():df1[col].last_valid_index()])):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-key])/df[col].iloc[i-key]
But I am getting the following error: KeyError: 'col_A'.
What am I doing wrong? And is there a better way to do this with less loops?

I think you are looking for something like the shift method (no for-loop is needed):
df1['col_A_rolling'] = (df1['col_A'] - df1['col_A'].shift(7)) / df1['col_A'].shift(7)
OUTPUT:
col_A col_B col_C col_A_rolling
2022-01-01 99330 12 122 NaN
2022-01-02 1123 1230 1287 NaN
2022-01-03 123 101 812739 NaN
2022-01-04 1143 1230123 252 NaN
2022-01-05 234 342 4546 NaN
2022-01-06 2445 3453 3457 NaN
2022-01-07 7897 8657 5675 NaN
2022-01-08 46 5675 453 -0.999537
2022-01-09 76 484 3735 -0.932324
2022-01-10 363 93 4568 1.951220
2022-01-11 385 568 367 -0.663167
2022-01-12 458 846 4847 0.957265
2022-01-13 574 45747 658468 -0.765235
2022-01-14 57457 46534 4675 6.275801

Related

pandas.to_datetime not converting all rows to datetime

simple transformation to convert a string date time to datetime in a df not working - please see last column 990 onwards
new_df = pd.melt(
frame=df,
id_vars={'Date', 'Day'}
)
new_df['new_date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='raise')
Date Day variable value new_date
0 1/5/2015 289 Cases_Guinea 2776.0 2015-01-05
1 1/4/2015 288 Cases_Guinea 2775.0 2015-01-04
2 1/3/2015 287 Cases_Guinea 2769.0 2015-01-03
3 1/2/2015 286 Cases_Guinea NaN 2015-01-02
4 12/31/2014 284 Cases_Guinea 2730.0 2014-12-31
5 12/28/2014 281 Cases_Guinea 2706.0 2014-12-28
6 12/27/2014 280 Cases_Guinea 2695.0 2014-12-27
7 12/24/2014 277 Cases_Guinea 2630.0 2014-12-24
8 12/21/2014 273 Cases_Guinea 2597.0 2014-12-21
9 12/20/2014 272 Cases_Guinea 2571.0 2014-12-20
.. ... ... ... ... ...
990 12/3/2014 256 Deaths_Guinea NaN NaT
991 11/30/2014 253 Deaths_Guinea 1327.0 NaT
992 11/28/2014 251 Deaths_Guinea NaN NaT
993 11/23/2014 246 Deaths_Guinea 1260.0 NaT
994 11/22/2014 245 Deaths_Guinea NaN NaT
995 11/18/2014 241 Deaths_Guinea 1214.0 NaT
996 11/16/2014 239 Deaths_Guinea 1192.0 NaT
997 11/15/2014 238 Deaths_Guinea NaN NaT

Python: Dynamically calculate rolling returns over different frequencies

Consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
EDIT: Here is my attempt at calculating the rolling return on a daily and weekly basis:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-'1W'])/df[col].iloc[i-'1W']
pct_change does the shifting math for you, but you would have to do it one window at a time.
windows = ["1D", "7D"]
for window in windows:
df1 = pd.merge(
df1,
(
df1[["col_A", "col_B", "col_C"]]
.pct_change(freq=window)
.add_suffix(f"_rolling_{window}")
),
left_index=True,
right_index=True,
)
You can use shift to shift your index by a certain time period. For instance you can shift everything one day with:
df1.shift(freq="1D").add_suffix("_1D")
This will then be something like:
col_A_1D col_B_1D col_C_1D
2022-01-02 99330 12 122
2022-01-03 1123 1230 1287
2022-01-04 123 101 812739
2022-01-05 1143 1230123 252
2022-01-06 234 342 4546
You can then add the new columns to the existing data:
df1.merge(df1.shift(freq="1D").add_suffix("_1D"), how="left", left_index=True, right_index=True)
col_A col_B col_C col_A_1D col_B_1D col_C_1D
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0
2022-01-03 123 101 812739 1123.0 1230.0 1287.0
2022-01-04 1143 1230123 252 123.0 101.0 812739.0
2022-01-05 234 342 4546 1143.0 1230123.0 252.0
And then just calculate e.g. (df1["col_A"] - df1["col_A_1D"]) / df1["col_A_1D"]. This will then result in:
2022-01-01 NaN
2022-01-02 -0.988694
2022-01-03 -0.890472
2022-01-04 8.292683
2022-01-05 -0.795276
You can do this for all the required columns and time shifts in the same way. For instance:
initial_cols = ["col_A", "col_B", "col_C"]
shifted_cols = [f"{c}_1D" for c in initial_cols]
for i, s in zip(initial_cols, shifted_cols):
df1[f"{i}_rolling"] = (df1[i] - df1[s]) / df1[s]
This will then result in:
col_A col_B col_C col_A_1D col_B_1D col_C_1D col_A_rolling col_B_rolling col_C_rolling
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0 -0.988694 101.500000 9.549180
2022-01-03 123 101 812739 1123.0 1230.0 1287.0 -0.890472 -0.917886 630.498834
2022-01-04 1143 1230123 252 123.0 101.0 812739.0 8.292683 12178.435644 -0.999690
2022-01-05 234 342 4546 1143.0 1230123.0 252.0 -0.795276 -0.999722 17.039683
So to answer the main question:
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
Yes, but there is also a way to do it without a loop :)

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.
This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

How to join DataFrame with multiple conditions on different columns?

I have two data-frames as follows:
mydata1:
ID X1 X2 Date1
002 324 634 2016-01-01
002 334 534 2016-01-14
002 354 834 2016-01-30
004 543 843 2017-02-01
004 923 043 2017-04-15
005 032 212 2015-09-01
005 523 843 2017-09-15
005 212 222 2015-10-1
mydata2:
ID Y1 Y2 Date2
002 1224 234 2016-01-04
002 1254 249 2016-01-28
004 321 212 2016-12-01
005 1121 222 2017-09-13
I want to merge these two data-frames based on ID and the Date where the difference between Date1 --dataframe1-- and Date2 --indataframe2--is less than 15. So, my desired data-frame as an output should be like this:
ID X1 X2 Date1. Y1. Y2. Date2
002 324 634 2016-01-01. nan. nan. nan
002 334 534 2016-01-14 1224 234 2016-01-04
002 354 834 2016-01-30. 1254 249 2016-01-28
004 543 843 2017-02-01 321 212 2015-12-01
004 923 043 2017-04-15. nan nan. nan
005 032 212 2015-09-01 nan nan. nan
005 523 843 2015-09-15. 1121 222 2017-09-13
005 212 222 2015-10-1. nan nan. nan
So your desired output is slightly wrong since one of the values is 2 years older than the joined value.
First we perform a join:
f = df.merge(df1, how='left', on='ID')
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 1224 234 2016-01-04
1 2 334 534 2016-01-14 1224 234 2016-01-04
2 2 354 834 2016-01-30 1224 234 2016-01-04
3 4 543 843 2017-02-01 321 212 2016-12-01
4 4 923 43 2017-04-15 321 212 2016-12-01
5 5 32 212 2015-09-01 1121 222 2015-09-13
6 5 523 843 2015-09-15 1121 222 2015-09-13
7 5 212 222 2015-10-1 1121 222 2015-09-13
Then we create a boolean mask:
mask = (pd.to_datetime(f['Date1'], format='%Y-%m-%d') - pd.to_datetime(f['Date2'], format='%Y-%m-%d')).apply(lambda i: i.days <= 15 and i.days > 0)
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
Then we set it to nan where the condition does not match:
f.loc[~mask, ['Y1', 'Y2', 'Date2']] = np.nan
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 NaN NaN NaN
1 2 334 534 2016-01-14 1224.0 234.0 2016-01-04
2 2 354 834 2016-01-30 NaN NaN NaN
3 4 543 843 2017-02-01 NaN NaN NaN
4 4 923 43 2017-04-15 NaN NaN NaN
5 5 32 212 2015-09-01 NaN NaN NaN
6 5 523 843 2015-09-15 1121.0 222.0 2015-09-13
7 5 212 222 2015-10-1 NaN NaN NaN

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help
The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

Categories

Resources