python compute pct_change() from last value of the previous month - python

I am trying to compute returns from last value of the previous month, here is the sample dataframe, daily value. I can't figure out how to achieve this with pct_change() functio
Sample df
date value
31/07/2020 141.793,00
03/08/2020 145.401,00
04/08/2020 124.534,00
05/08/2020 147.562,00
06/08/2020 131.043,00
07/08/2020 132.556,00
10/08/2020 140.874,00
11/08/2020 128.603,00
01/09/2020 131.451,00
02/09/2020 137.862,00
03/09/2020 130.439,00
04/09/2020 124.608,00
07/09/2020 133.674,00
08/09/2020 126.454,00
09/09/2020 136.488,00
Goal
I need to compute the current monthly cumulated return for each day. The return value for the day should be the return from the last value of the previous month. Something like this:
date value monthly
31/07/2020 141.793,00 NaN
03/08/2020 145.401,00 0,025445544
04/08/2020 124.534,00 -0,12171969
05/08/2020 147.562,00 0,040686071
06/08/2020 131.043,00 -0,075814744
07/08/2020 132.556,00 -0,06514426
10/08/2020 140.874,00 -0,006481279
11/08/2020 128.603,00 -0,093022928
01/09/2020 131.451,00 0,022145673
02/09/2020 137.862,00 0,071996765
03/09/2020 130.439,00 0,014276494
04/09/2020 124.608,00 -0,031064594
07/09/2020 133.674,00 0,039431429
08/09/2020 126.454,00 -0,016710341
09/09/2020 136.488,00 0,061312722

I believe you can get what you need with the following.
Use str.replace to replace the , , then convert to float, and then apply pct_change() and return a new column:
df['monthly'] = df['value'].str.replace(',','').astype(float).pct_change()
which prints:
date value monthly
0 31/07/2020 141.793,00 NaN
1 2020-03-08 00:00:00 145.401,00 0.025446
2 2020-04-08 00:00:00 124.534,00 -0.143513
3 2020-05-08 00:00:00 147.562,00 0.184913
4 2020-06-08 00:00:00 131.043,00 -0.111946
5 2020-07-08 00:00:00 132.556,00 0.011546
6 2020-10-08 00:00:00 140.874,00 0.062751
7 2020-11-08 00:00:00 128.603,00 -0.087106
8 2020-01-09 00:00:00 131.451,00 0.022146
9 2020-02-09 00:00:00 137.862,00 0.048771
10 2020-03-09 00:00:00 130.439,00 -0.053844
11 2020-04-09 00:00:00 124.608,00 -0.044703
12 2020-07-09 00:00:00 133.674,00 0.072756
13 2020-08-09 00:00:00 126.454,00 -0.054012
14 2020-09-09 00:00:00 136.488,00 0.079349

Related

How Can I get the first date on or after a given date?

I am using the following function. My index is a series of dates and I am looking to get the first date or first subsequent date if the date is not available of every month. I used the following code which gets the nearest date to the first date but causes a problem when i have in this case the 31st Dec closer to the 1st Jan rather than what should be the 4th Jan.
df['month'] = df.index.to_numpy().astype('datetime64[M]')
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for n in range(len(df)):
d = nearest(df.index, df['month'][n])
print(d)
output:
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2021-02-01 00:00:00
2021-02-01 00:00:00
Is there an easy way to amend my code so that I get 2021-01-04 rather than 2020-12-31
Date x y z
28/12/2020 3723.030029 3735.360107 133.990005
29/12/2020 3750.01001 3727.040039 138.050003
30/12/2020 3736.189941 3732.040039 135.580002
31/12/2020 3733.27002 3756.070068 134.080002
04/01/2021 3764.610107 3700.649902 133.520004
05/01/2021 3698.02002 3726.860107 128.889999
06/01/2021 3712.199951 3748.139893 127.720001
07/01/2021 3764.709961 3803.790039 128.360001
08/01/2021 3815.050049 3824.679932 132.429993
11/01/2021 3803.139893 3799.610107 129.190002
12/01/2021 3801.620117 3801.189941 128.5
13/01/2021 3802.22998 3809.840088 128.759995
14/01/2021 3814.97998 3795.540039 130.800003
15/01/2021 3788.72998 3768.25 128.779999
19/01/2021 3781.879883 3798.909912 127.779999
20/01/2021 3816.219971 3851.850098 128.660004
21/01/2021 3857.459961 3853.070068 133.800003
22/01/2021 3844.23999 3841.469971 136.279999
25/01/2021 3851.679932 3855.360107 143.070007
26/01/2021 3862.959961 3849.620117 143.600006
27/01/2021 3836.830078 3750.77002 143.429993
28/01/2021 3755.75 3787.379883 139.520004
29/01/2021 3778.050049 3714.23999 135.830002
01/02/2021 3731.169922 3773.860107 133.75
02/02/2021 3791.840088 3826.310059 135.729996
03/02/2021 3840.27002 3830.169922 135.759995
04/02/2021 3836.659912 3871.73999 136.300003
05/02/2021 3878.300049 3886.830078 137.350006
08/02/2021 3892.590088 3915.590088 136.029999
09/02/2021 3910.48999 3911.22998 136.619995

Filter and group-by subscribers if start and end subscription date are within 14days

I have a newspaper dataset and I consider subscribers who end and restart their subscriptions within 14 days as one. Therefore, I want to compare each end date to each start date of a CRM_relation_number. If within 14 days, I want to make a new column where the index is corresponding to the index in the original dataframe as a list.
input:
df = pd.DataFrame({'index': [1,2,3,4,5,6],'CRM_relation_number': ["PRS-1000005", "PRS-1000005", "PRS-1000005", "PRS-1000017" , "PRS-1000017", "PRS-1000017"], 'newspaper_name': ["stackoverflow_times","the_stackoverflow_post","the_stackoverflow_post","stackoverflow_times","stackoverflow_times","stackoverflow_times"], "Startdate": ["2016-12-07", "2020-07-04", "2019-09-28", "2019-01-04", "2018-01-02","2016-09-17"], "Stopdate":["2019-07-03", "2020-09-20", "2020-07-03", "2019-11-11", "2018-12-31", "2017-12-29"]})
index CRM_relation_number newspaper_name Startdate Stopdate
1 PRS-1000005 stackoverflow_times 2016-12-07 2019-07-03
2 PRS-1000005 the_stackoverflow_post 2020-07-04 2020-09-20
3 PRS-1000005 the_stackoverflow_post 2019-09-28 2020-07-03
4 PRS-1000017 stackoverflow_times 2019-01-04 2019-11-11
5 PRS-1000017 stackoverflow_times 2018-01-02 2018-12-31
6 PRS-1000017 stackoverflow_times 2016-09-17 2017-12-29
expected output:
index CRM_relation_number newspaper_name Startdate Stopdate follow_up_subsprcition
3 PRS-1000005 the_stackoverflow_post 2019-09-28 2020-07-03 [2]
4 PRS-1000017 stackoverflow_times 2019-01-04 2019-11-11 [5, 6]

How to get all Sundays on dates in pandas and extract the corresponding values with it then save as new dataframe and do subtraction

I have a dataframe with 3 columns:
file = glob.glob('InputFile.csv')
for i in file:
df = pd.read_csv(i)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date X Y
0 2020-02-13 00:11:59 -91.3900 -31.7914
1 2020-02-13 01:11:59 -87.1513 -34.6838
2 2020-02-13 02:11:59 -82.9126 -37.5762
3 2020-02-13 03:11:59 -79.3558 -40.2573
4 2020-02-13 04:11:59 -73.2293 -44.2463
... ... ... ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000
[2039 rows x 3 columns]
print(converted_file.dtypes)
Date datetime64[ns]
xTilt float64
yTilt float64
dtype: object
I would like the output to be:
Date X Y X_Diff Y_Diff
0 2020-02-16 00:11:59 -38.46270 -70.8352 -38.46270 -70.8352
1 2020-02-23 00:11:59 -80.70250 -7.1893 -42.23980 63.6459
2 2020-03-01 00:11:59 -47.38980 -39.2652 33.31270 -32.0759
3 2020-03-08 00:00:00 -35.65350 -64.5058 11.73630 -25.2406
4 2020-03-15 00:00:00 -43.03290 -15.8425 -7.37940 48.6633
5 2020-03-22 00:00:00 -19.77130 -25.5298 23.26160 -9.6873
6 2020-03-29 00:00:00 -13.18940 12.4093 6.58190 37.9391
7 2020-04-05 00:00:00 -8.49098 27.8407 4.69842 15.4314
8 2020-04-12 00:00:00 -19.05360 20.0445 -10.56262 -7.7962
9 2020-04-26 00:00:00 -25.61330 31.6306 -6.55970 11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557 -20.47920 -61.9863
In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.
Here is my solution.
1. Preparation: I will need to generate some random data to be worked on.
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
The data is like this:
Date X Y
0 2020-02-13 00:00:00 -12.044751 165.962038
1 2020-02-13 01:00:00 63.537406 65.137176
2 2020-02-13 02:00:00 67.555256 114.186898
... ... ... ..
2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.
df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date
Then, it looks like this.
Date X Y date_only
96 2020-02-17 00:00:00 26.632391 120.311315 2020-02-17
97 2020-02-17 01:00:00 -14.111209 21.543440 2020-02-17
98 2020-02-17 02:00:00 -11.941086 -51.303122 2020-02-17
99 2020-02-17 03:00:00 -48.612563 137.023917 2020-02-17
100 2020-02-17 04:00:00 133.843010 -47.168805 2020-02-17
... ... ... ... ...
1796 2020-04-27 20:00:00 -158.310600 30.149292 2020-04-27
1797 2020-04-27 21:00:00 170.212825 181.626611 2020-04-27
1798 2020-04-27 22:00:00 59.773796 11.262186 2020-04-27
1799 2020-04-27 23:00:00 -99.757428 83.529157 2020-04-27
1944 2020-05-04 00:00:00 -168.435315 245.884281 2020-05-04
3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.
df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])
Results:
Date X Y
0 2020-02-17 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274
2 2020-03-02 -231.596763 -46.989246
3 2020-03-09 76.561269 -40.188202
4 2020-03-16 -18.653363 52.376442
5 2020-03-23 106.758484 22.969963
6 2020-03-30 -133.601545 185.561830
7 2020-04-06 -57.748555 -187.878427
8 2020-04-13 57.648834 10.365917
9 2020-04-20 -47.959093 177.455676
10 2020-04-27 -30.527067 -37.046330
11 2020-05-04 -52.854252 -136.069205
4. Last step, get the difference for each X/Y value with their previous value.
df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()
Results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 NaN NaN
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.
df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)
Final results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00, and the time 00:00:00 will not be displayed in pandas, although they actually exist.
Here is the Colab Link. You can have all my code in a notebook here.
https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing
I will create a dataframe as Christopher did:
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
Dataframe view
At First, set the datetime column as index
df = df.set_index('Date')
Secondly, get the rows only for sundays:
sunday_df= df[df.index.dayofweek == 6]
Third, resample the values to day format, take the last value of the day and remove rows with empty hours
sunday_df = sunday_df.resample('D').last().dropna()
Lastly, do the subtraction:
sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()
The last view of the new dataframe

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources