How to average forward looking data in a pandas dataframe - python

I have a df with values: eg.
jpy3m jpy1w
timestamp
2019-01-09 00:00:00 -45 -25
2019-01-08 00:00:00 -48 -31
2019-01-07 00:00:00 -51 -27
2019-01-04 00:00:00 -46 -25
...
2016-01-06 00:00:00 -61 -26
2016-01-05 00:00:00 -62 -22
2016-01-04 00:00:00 -57 -21
The index is from today to the start of 2016. Business days only.
What I wish to process, but am unable to, is, for each day where it is possible to do so, take the value in jpy3m and take away the average of (the value of jpy1w on the same day, but also, the value of jpy1w over the next 11 weeks)
e.g. consider 2016-01-04
take value of jpy3m = -57
take average of jpy 1w on (2016-01-04,2016-01-11(1w later),2016-01-19(2w later (18th is not a good business day),2016-01-25(3w later)....etc, until 2016-03-25(11w later))
lets say this averages -25
then take -25 - (-57) = +32, so +32 is the value for 4th jan
This will go in a new column, df['result']
And repeat for the 5th jan 16, and so forth.
I understand the most recent 3 months wont have a result and will be np.nan
is this possible?
thank you

I am assuming that if the day is not a good business day then that record is not in your dataset. If it is in your dataset then you can remove those records.
Sort records in descending order of time.
we are averaging current value with next 7th, 14th, ...., 77th values.
avgs = df['jpy1w'].copy()
for i in range(11):
avgs = df['jpy1w'].shift(7*(i+1)) + avgs
avgs = avgs/12
df['result'] = df['jpy3m'] + avgs

Related

Matching strftime weeks with pandas time series week frequency

I have a dataframe with a '%Y/%U' date column:
Value Count YW Date
0 2 2017/19 2017-05-13
1 2 2017/20 2017-05-19
2 24 2017/22 2017-06-03
3 35 2017/23 2017-06-10
4 41 2017/24 2017-06-17
.. ... ... ...
126 51 2020/05 2020-02-06
127 26 2020/06 2020-02-15
128 30 2020/07 2020-02-22
129 26 2020/08 2020-02-29
130 18 2020/09 2020-03-04
I'm trying to add the missing weeks, like 2017/21 with 0 Count values, so I created this index:
idx = pdh.pd.date_range(df['Date'].min(), df['Date'].max(), freq='W').floor('d')
Which yields:
DatetimeIndex(['2017-05-14', '2017-05-21', '2017-05-28', '2017-06-04',
'2017-06-11', '2017-06-18', '2017-06-25', '2017-07-02',
'2017-07-09', '2017-07-16',
...
'2019-12-29', '2020-01-05', '2020-01-12', '2020-01-19',
'2020-01-26', '2020-02-02', '2020-02-09', '2020-02-16',
'2020-02-23', '2020-03-01'],
dtype='datetime64[ns]', length=147, freq=None)
Almost there, converting to '%Y/%U' again:
idx = idx.strftime('%Y/%U')
But this yields:
Index(['2017/20', '2017/21', '2017/22', '2017/23', '2017/24', '2017/25',
'2017/26', '2017/27', '2017/28', '2017/29',
...
'2019/52', '2020/01', '2020/02', '2020/03', '2020/04', '2020/05',
'2020/06', '2020/07', '2020/08', '2020/09'],
dtype='object', length=147)
I'm not sure yet whether it is a problem with reindexing but I've noticed that the firts year/week pair is now 2017/20 instead of 2017/19. This is because the freq='W' offset converts every date to the correspondent week starting day as the default is the same as 'W-SUN' anchored offset. Indeed, 2017-05-14 is a Sunday.
The problem is that the converted date now returns the next week number because of this, 2017-05-13 was converted to 2017-05-14. Using the %U strftime code does start the weeks on Sunday as well, however it is counted from the previous Sunday. Using 'W-SAT' (as 2017-05-13 was a Saturday) solves it at the start but the end will be wrong this case.
Is there any dynamic solution so date_range would start and end with the proper weeks?

Resample dataframe based on time ranges, ignoring date

I am trying to resample my data to get sums. This resampling needs to be based solely on time. I want to group the times in 6 hours, so regardless of the date I will get 4 sums.
My df looks like this:
booking_count
date_time
2013-04-04 08:32:25 58
2013-04-04 18:43:11 1
2013-30-04 12:39:15 52
2013-14-05 06:51:33 99
2013-01-06 23:59:17 1
2013-03-06 19:37:25 42
2013-27-06 04:12:01 38
With this example data, I expect the get the following results:
00:00:00 38
06:00:00 157
12:00:00 52
18:00:00 43
To get around the date issue, I tried to keep only the time values:
df['time'] = pd.DatetimeIndex(df['date_time']).time
new_df = df[['time', 'booking_bool']].set_index('time').resample('360min').sum()
Unfortunately, this was to no avail. How do I go about getting my required results? Is resample() even suitable for this task?
I don't think resample() is a good method to do this because you need to groupby based on hours independently of the day. Maybe you can try using cut using a custom bins parameter, and then a usual groupby
bins = np.arange(start=0, stop=24+6, step=6)
group = df.groupby(pd.cut(
df.index.hour,
bins, right=False,
labels=pd.date_range('00:00:00', '18:00:00', freq='6H').time)
).sum()
group
# booking_count
# 00:00:00 38
# 06:00:00 157
# 12:00:00 52
# 18:00:00 44

Pandas Dataframe resample week, starting first day of the year

I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348

Pandas time series resample, binning seems off

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources