I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
Related
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31
Dear experienced community,
I can't find an elegant solution to my problem.
I have a subsample of my dataset which I want to resample weekly, but starting some weeks before the first entry in my data frame (so a few weeks with 0 counts)
A sample of the data:
In:
print(df_pec.head())
Out:
Count Image_Sequence_DateTime
18 1 2015-11-06 03:22:19
21 1 2015-11-11 01:48:51
22 1 2015-11-11 07:30:47
37 1 2015-11-25 09:42:23
48 1 2015-12-05 12:12:34
With the earliest image sequence at:
In:
df_pec.Image_Sequence_DateTime.min()
Out:
2015-09-30 15:16:38
I have another function that gives me the starting point of the first week and the last point of the last week ever measured in that experiment, which are:
In:
print(s_startend)
Out:
Start 2015-09-28
End 2017-12-25
dtype: datetime64[ns]
My problem is that I want to resample df_pec weekly, but starting on the very first second of the very first day of the very first week of the experimental deployment.(using s_startend as reference)
I try:
df_pec=df_pec.resample('1W', on='Image_Sequence_DateTime').sum()
print(df_pec.head(),'\n',df_pec.tail())
Out:
Count
Image_Sequence_DateTime
2015-10-04 26.0
2015-10-11 92.0
2015-10-18 204.0
2015-10-25 193.0
2015-11-01 187.0
Count
Image_Sequence_DateTime
2017-11-19 20.0
2017-11-26 34.0
2017-12-03 16.0
2017-12-10 11.0
2017-12-17 3.0
This is pretty weird because it is even skipping the first days of data in df_pec.(starting 2015-09-30 15:16:38)
And even if it worked, I have no way of indicating the resampling to start and end in specified values (s_startend from my example), even if there are no records in the earliest and latest weeks in my subsample df_pec.
I thought about artificially adding two entries to df_pec with the real start and real end, but I think it is not so elegant and I don't want to be adding meaningless keys to my df.
Thank you very much for your wisdom!
I have a dataframe of transactions. One of my columns is the date (datetime64[ns]). I'm making a group by of users (email as id). Something I'm interested in is the variability of time between orders of each user. So what I'm looking for in the group by is to find the standard deviation of the difference between dates (in days) for each user. If the user has two or least transactions the answer should be 0. This is some of the dataframe (I changed some things manually):
df
email date
0 cuadros.paolo#gmail.com 2018-05-01 12:29:59
1 rlez_1202#hotmail.com 2018-07-11 13:43:22
2 cuadros.paolo#gmail.com 2018-09-21 12:29:23
3 paola.alvarado#rumah.com.pe 2018-09-01 09:21:43
4 luchosuito#gmail.com 2018-04-30 12:29:30
5 paola.alvarado#rumah.com.pe 2018-03-22 12:29:23
6 davida.alvarado.703#gmail.com 2018-07-21 12:29:17
7 cuadros.paolo#gmail.com 2018-08-11 12:29:41
8 rlez_1202#hotmail.com 2018-05-23 12:29:14
9 luchosuito#gmail.com 2018-06-01 12:29:17
10 jessica26011#hotmail.com 2018-07-18 12:29:20
11 cuadros.paolo#gmail.com 2018-08-21 12:29:40
12 rlez_1202#hotmail.com 2018-10-01 12:29:31
13 paola.alvarado#rumah.com.pe 2018-06-01 12:29:20
14 miluska-paico#hotmail.com 2018-05-21 12:29:18
15 cinthia_leon87#hotmail.com 2018-07-20 12:29:59
I've tried many ways, but still can't get it. Please help.
For sequential differences, which seems to make the most sense given your explanation:
df.sort_values('date').groupby('email').apply(lambda x: x.date.diff().std()).fillna(0)
Output:
email
cinthia_leon87#hotmail.com 0 days 00:00:00
cuadros.paolo#gmail.com 48 days 05:04:12.988006
davida.alvarado.703#gmail.com 0 days 00:00:00
jessica26011#hotmail.com 0 days 00:00:00
luchosuito#gmail.com 0 days 00:00:00
miluska-paico#hotmail.com 0 days 00:00:00
paola.alvarado#rumah.com.pe 14 days 18:10:16.764069
rlez_1202#hotmail.com 23 days 06:17:04.453408
dtype: timedelta64[ns]
.std() will be null for groups with 1 value non-null value and since .diff reduces the number of non-null observations by 1, this automatically returns NaN for any groups with 2 or fewer measurements, which we fill with 0.
Also just be aware that the default for pandas is to use N-1 degrees of freedom.
Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)
You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5