Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)
You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5
Related
I am trying to resample my data to get sums. This resampling needs to be based solely on time. I want to group the times in 6 hours, so regardless of the date I will get 4 sums.
My df looks like this:
booking_count
date_time
2013-04-04 08:32:25 58
2013-04-04 18:43:11 1
2013-30-04 12:39:15 52
2013-14-05 06:51:33 99
2013-01-06 23:59:17 1
2013-03-06 19:37:25 42
2013-27-06 04:12:01 38
With this example data, I expect the get the following results:
00:00:00 38
06:00:00 157
12:00:00 52
18:00:00 43
To get around the date issue, I tried to keep only the time values:
df['time'] = pd.DatetimeIndex(df['date_time']).time
new_df = df[['time', 'booking_bool']].set_index('time').resample('360min').sum()
Unfortunately, this was to no avail. How do I go about getting my required results? Is resample() even suitable for this task?
I don't think resample() is a good method to do this because you need to groupby based on hours independently of the day. Maybe you can try using cut using a custom bins parameter, and then a usual groupby
bins = np.arange(start=0, stop=24+6, step=6)
group = df.groupby(pd.cut(
df.index.hour,
bins, right=False,
labels=pd.date_range('00:00:00', '18:00:00', freq='6H').time)
).sum()
group
# booking_count
# 00:00:00 38
# 06:00:00 157
# 12:00:00 52
# 18:00:00 44
Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
How can I loop through pandas's loc[] function such that given a long Series, I can break it into multiple little ones. Something I imagine would be like
for i in range(1,10):
df.loc['2002-i-01:'2002-(i+1)-01']
Where i represents the number of months.
Consider the dataframe df
df = pd.DataFrame(dict(A=range(100)), pd.date_range('2010-03-31', periods=100))
Observe that you are asking to slice from the beginning of one month to the beginning of the next. Typical python slicing does not include the end point (though loc does). I'll assume you meant to exclude it as that makes this answer convenient.
Use resample with a frequency 'M'
df.resample('M').sum()
A
2010-03-31 0
2010-04-30 465
2010-05-31 1426
2010-06-30 2295
2010-07-31 764
You can iterate through each month
for m, grp in df.groupby(pd.TimeGrouper('M')):
# do stuff
print(m)
2010-03-31 00:00:00
2010-04-30 00:00:00
2010-05-31 00:00:00
2010-06-30 00:00:00
2010-07-31 00:00:00
I have a sequence of datetime objects and a series of data which spans through several years. A can create a Series object and resample it to group it by months:
df=pd.Series(varv,index=dates)
multiMmean=df.resample("M", how='mean')
print multiMmean
This, however, outputs
2005-10-31 172.4
2005-11-30 69.3
2005-12-31 187.6
2006-01-31 126.4
2006-02-28 187.0
2006-03-31 108.3
...
2014-01-31 94.6
2014-02-28 82.3
2014-03-31 130.1
2014-04-30 59.2
2014-05-31 55.6
2014-06-30 1.2
which is a list of the mean value for each month of the series. This is not what I want. I want 12 values, one for every month of the year with a mean for each month through the years. How do I get that for multiMmean?
I have tried using resample("M",how='mean') on multiMmean and list comprehensions but I cannot get it to work. What am I missing?
Thank you.
the following worked for me:
# create some random data with datetime index spanning 17 months
s = pd.Series(index=pd.date_range(start=dt.datetime(2014,1,1), end = dt.datetime(2015,6,1)), data = np.random.randn(517))
In [25]:
# now calc the mean for each month
s.groupby(s.index.month).mean()
Out[25]:
1 0.021974
2 -0.192685
3 0.095229
4 -0.353050
5 0.239336
6 -0.079959
7 0.022612
8 -0.254383
9 0.212334
10 0.063525
11 -0.043072
12 -0.172243
dtype: float64
So we can groupby the month attribute of the datetimeindex and call mean this will calculate the mean for all months