I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?
Related
I have a dataframe with a '%Y/%U' date column:
Value Count YW Date
0 2 2017/19 2017-05-13
1 2 2017/20 2017-05-19
2 24 2017/22 2017-06-03
3 35 2017/23 2017-06-10
4 41 2017/24 2017-06-17
.. ... ... ...
126 51 2020/05 2020-02-06
127 26 2020/06 2020-02-15
128 30 2020/07 2020-02-22
129 26 2020/08 2020-02-29
130 18 2020/09 2020-03-04
I'm trying to add the missing weeks, like 2017/21 with 0 Count values, so I created this index:
idx = pdh.pd.date_range(df['Date'].min(), df['Date'].max(), freq='W').floor('d')
Which yields:
DatetimeIndex(['2017-05-14', '2017-05-21', '2017-05-28', '2017-06-04',
'2017-06-11', '2017-06-18', '2017-06-25', '2017-07-02',
'2017-07-09', '2017-07-16',
...
'2019-12-29', '2020-01-05', '2020-01-12', '2020-01-19',
'2020-01-26', '2020-02-02', '2020-02-09', '2020-02-16',
'2020-02-23', '2020-03-01'],
dtype='datetime64[ns]', length=147, freq=None)
Almost there, converting to '%Y/%U' again:
idx = idx.strftime('%Y/%U')
But this yields:
Index(['2017/20', '2017/21', '2017/22', '2017/23', '2017/24', '2017/25',
'2017/26', '2017/27', '2017/28', '2017/29',
...
'2019/52', '2020/01', '2020/02', '2020/03', '2020/04', '2020/05',
'2020/06', '2020/07', '2020/08', '2020/09'],
dtype='object', length=147)
I'm not sure yet whether it is a problem with reindexing but I've noticed that the firts year/week pair is now 2017/20 instead of 2017/19. This is because the freq='W' offset converts every date to the correspondent week starting day as the default is the same as 'W-SUN' anchored offset. Indeed, 2017-05-14 is a Sunday.
The problem is that the converted date now returns the next week number because of this, 2017-05-13 was converted to 2017-05-14. Using the %U strftime code does start the weeks on Sunday as well, however it is counted from the previous Sunday. Using 'W-SAT' (as 2017-05-13 was a Saturday) solves it at the start but the end will be wrong this case.
Is there any dynamic solution so date_range would start and end with the proper weeks?
I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
I have a dataframe with full year data of values on each second:
YYYY-MO-DD HH-MI-SS_SSS TEMPERATURE (C)
2016-09-30 23:59:55.923 28.63
2016-09-30 23:59:56.924 28.61
2016-09-30 23:59:57.923 28.63
... ...
2017-05-30 23:59:57.923 30.02
I want to create a new dataframe which takes each week or month of values and average them over the same hour of each day (kind of moving average but for each hour).
So the result for the month case will be like this:
Date TEMPERATURE (C)
2016-09 00:00:00 28.63
2016-09 01:00:00 27.53
2016-09 02:00:00 27.44
...
2016-10 00:00:00 28.61
... ...
I'm aware of the fact that I can split the df into 12 df's for each month and use:
hour = pd.to_timedelta(df['YYYY-MO-DD HH-MI-SS_SSS'].dt.hour, unit='H')
df2 = df.groupby(hour).mean()
But I'm searching for a better and faster way.
Thanks !!
Here's an alternate method of converting your date and time columns:
df['datetime'] = pd.to_datetime(df['YYYY-MO-DD'] + ' ' + df['HH-MI-SS_SSS'])
Additionally you could groupby both week and hour to form a MultiIndex dataframe (instead of creating and managing 12 dfs):
df.groupby([df.datetime.dt.weekofyear, df.datetime.dt.hour]).mean()
I have a time-serie on daily frequency across 1204 days.
I want to resample it on a 365D basis (by summing) but the time-serie runs across 3,29 * 365D, not a multiple of 365D.
By default, resample is returning 4 lines.
Here is the raw data:
DATE
2012-08-12 15350.0
2012-08-19 11204.0
2012-08-26 11795.0
2012-09-02 15160.0
2012-09-09 9991.0
2012-09-16 12337.0
2012-09-23 10721.0
2012-09-30 9952.0
2012-10-07 11903.0
2012-10-14 8537.0
...
2015-09-27 14234.0
2015-10-04 17917.0
2015-10-11 13610.0
2015-10-18 8716.0
2015-10-25 15191.0
2015-11-01 8925.0
2015-11-08 13306.0
2015-11-15 8884.0
2015-11-22 11527.0
2015-11-29 6859.0
df.index.max() - df.index.min()
Timedelta('1204 days 00:00:00')
If I apply:
df.resample('365D').sum()
I got:
DATE
2012-08-12 536310.0
2013-08-12 555016.0
2014-08-12 569548.0
2015-08-12 245942.0
Freq: 365D, dtype: float64
It seems like the last bin is the one covering less than 365 days.
How do I force resample to exclude it from the result?
df.resample('365D') starts sampling at lowest day in index. So last bin will be almost allways not covering all days. Just skip it
df.resample('365D').sum()[:-1]
You can also consider sampling by start/end of the year
df.resample('A').sum()