Pandas: Count days in each month between given start and end date - python

I have a pandas dataframe with some beginning and ending dates.
ActualStartDate ActualEndDate
0 2019-06-30 2019-08-15
1 2019-09-01 2020-01-01
2 2019-08-28 2019-11-13
Given these start & end dates I need to count how many days in each month between beginning and ending dates. I can't figure out a good way to approach this, but resulting dataframe should be something like:
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 etc
0 2019-06-30 2019-08-15 1 31 15 0 0 0 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31 30 31 1
2 2019-08-28 2019-11-13 0 0 4 30 31 13 0 0
Note that actual dataframe has ~1,500 rows with varying beginning & end dates. Open to different df output, but showing the above to give you the idea of what I need to accomplish. Thank you in advance for any help!

Idea is create month periods by DatetimeIndex.to_period from date_range and count by Index.value_counts, then create DataFrame by concat with replace missing values by DataFrame.fillna, last join to original by DataFrame.join:
L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
for r in df.itertuples()}
df = df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
print (df)
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 \
0 2019-06-30 2019-08-15 1 31 15 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31
2 2019-08-28 2019-11-13 0 0 4 30 31
2019-11 2019-12 2020-01
0 0 0 0
1 30 31 1
2 13 0 0
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [44]: %%timeit
...: L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
...: for r in df.itertuples()}
...: df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
...:
689 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %%timeit
...: df.join(
...: df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
...: .apply(pd.value_counts, axis=1)
...: .fillna(0)
...: .astype(int))
...:
994 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Probably not the most efficient but shouldn't be too bad for ~1500 rows... expand out a date range and then convert it to a monthly period, take the counts of those and rejoin back to your original DF, eg:
res = df.join(
df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int)
)
Gives you:
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11
0 2019-06-30 2020-08-15 1 31 31 30 31 30 31 31 29 31 30 31 30 31 15 0 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31 30 31 1 0 0 0 0 0 0 0 0 0 0
2 2019-08-28 2020-11-13 0 0 4 30 31 30 31 31 29 31 30 31 30 31 31 30 31 13

import pandas as pd
import calendar
date_info = pd.DataFrame({
'ActualStartDate': [
pd.Timestamp('2019-06-30'),
pd.Timestamp('2019-09-01'),
pd.Timestamp('2019-08-28'),
],
'ActualEndDate': [
pd.Timestamp('2019-08-15'),
pd.Timestamp('2020-01-01'),
pd.Timestamp('2019-11-13'),
]
})
# ============================================================
result = {} # result should in dict, in case of too many cols.
for index, timepair in date_info.iterrows():
start = timepair['ActualStartDate']
end = timepair['ActualEndDate']
current = start
result[index] = {} # delta days in this pair
while True:
# find the delta days
# current day is also count, so should + 1
_, days = calendar.monthrange(current.year, current.month)
days = min(days, (end - current).days + 1)
delta = days - current.day + 1
result[index]['%s-%s'%(current.year, current.month)] = delta
current += pd.Timedelta(delta, unit='d')
if current >= end:
break
# you can save the result in dataframe, if you insisit
columns = set()
for value in result.values():
columns.update(value.keys())
for col in columns:
date_info[col] = 0
for index, delta in result.items():
for date, days in delta.items():
date_info.loc[index, date] = days
print(date_info)

Related

Choosing values with df.quantile() for separate years and months

I have a large data set and I want to add values to a column based on the higest values in another column in my data set.
Easy, I can just use df.quantile() and access the appropriate values
However, I want to check for each month in each year.
I solved it for looking at years only, see code below.
I'm sure I could do it for months with nested for loops but I'd rather avoid it if I can.
I guess the most pythonic way would by to not loop at all but use pandas in a smarter way..
Any suggestion?
Sample code:
df = pd.read_excel(file)
df.index = df['date']
df = df.drop('date', axis=1)
df['new'] = 0
year = (2016, 2017, 2018, 2019, 2020)
for i in year:
df['new'].loc[str(i)] = np.where(df['cost'].loc[str(i)] < df['cost'].loc[str(i)].quantile(0.5), 0, 1)
print(df)
Sample input
file =
cost
date
2016-11-01 30
2016-12-01 29
2017-11-01 40
2017-12-01 45
2018-11-30 240
2018-12-01 200
2019-11-30 220
2019-12-30 180
2020-11-30 150
2020-12-30 130
Output
cost new
date
2016-11-01 30 1
2016-12-01 29 0
2017-11-01 40 0
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 0
2019-11-30 220 1
2019-12-30 180 0
2020-11-30 150 1
2020-12-30 130 0
Desired output (if quantile works like that on single values, but as an example)
cost new
date
2016-11-01 30 1
2016-12-01 29 1
2017-11-01 40 1
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 1
2019-11-30 220 1
2019-12-30 180 1
2020-11-30 150 1
2020-12-30 130 1
Thank you _/_
An interesting question, it took me a little while to work out a solution!
import pandas as pd
df = pd.DataFrame(data={"cost": [30, 29, 40, 45, 240, 200, 220, 180, 150, 130],
"date": ["2016-11-01", "2016-12-01", "2017-11-01",
"2017-12-01", "2018-11-30", "2018-12-01",
"2019-11-30", "2019-12-30", "2020-11-30",
"2020-12-30"]})
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["new"] = df.groupby([lambda x: x.year, lambda x: x.month]).transform(lambda x: (x >= x.quantile(0.5))*1)
#Out:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 1
#2017-11-01 40 1
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 1
#2019-11-30 220 1
#2019-12-30 180 1
#2020-11-30 150 1
#2020-12-30 130 1
What the important line does:
Groups by the index year and month
For each item in the group, calculates whether it is greater than or equal to the 0.5 quantile (as bool)
Multiplying by 1 creates an integer bool (1/0) instead of True/False
The initial creation of the dataframe should be equivalent to your df = pd.read_excel(file)
Leaving out the , lambda x: x.month part of the groupby (by year only), the output is the same as your current output:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 0
#2017-11-01 40 0
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 0
#2019-11-30 220 1
#2019-12-30 180 0
#2020-11-30 150 1
#2020-12-30 130 0

Convert day of the year to datetime

I have a data files containing year, day of the year (DOY), hour and minutes as following:
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts
0 300234065718160 2019 7 0 216.2920 216.2920 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.3750 216.3750 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.4170 216.4170 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.4580 216.4580 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.5000 216.5000 58.561 -23.910 14.60
In order to make my datetime, I used:
dt_raw = pd.to_datetime(df_buoy['Year'] * 1000 + df_buoy['DOY'], format='%Y%j')
# Convert to datetime
dt_buoy = [d.date() for d in dt_raw]
date = datetime.datetime.combine(dt_buoy[0], datetime.time(df_buoy.Hour[0], df_buoy.Min[0]))
My problem arises when the hours are not int, but float instead. For example:
BuoyID Year Hour Min DOY POS_DOY Lat Lon BP Ts
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792 1016.9 -0.01
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826 1016.8 3.36
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856 1016.8 3.28
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876 1016.8 3.22
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894 1016.8 3.18
What I tried to do was to convert the hours in str, get the first two indexes, thus obtaining the hour, and then subtract this from the 'Hour' and multiply by 60 to get minutes.
int_hour = [(int(str(i)[0:2])) for i in df_buoy.Hour]
minutes = map(lambda x, y: (x - y)*60, df_buoy.Hour, int_hour)
But, of course, if you have '0.' as your hour, Python will complain:
ValueError: invalid literal for int() with base 10: '0.'
My question is: does anyone know a simple way to convert year, DOY, hour (either int or *float) and minutes to datetime in a simple way?
Use to_timedelta for convert hours columns and add to datetimes, working well with integers and floats:
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts \
0 300234065718160 2019 7 0 216.292 216.292 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.375 216.375 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.417 216.417 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.458 216.458 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.500 216.500 58.561 -23.910 14.60
d
0 2019-08-04 07:00:00
1 2019-08-04 09:00:00
2 2019-08-04 10:00:00
3 2019-08-04 11:00:00
4 2019-08-04 12:00:00
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon \
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894
BP Ts d
0 1016.9 -0.01 2014-08-14 23:19:48
1 1016.8 3.36 2014-08-14 23:30:00
2 1016.8 3.28 2014-08-14 23:40:12
3 1016.8 3.22 2014-08-14 23:49:48
4 1016.0 NaN 2014-08-15 00:00:00

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Numbers of Day in Month

I have a data frame with a date time index, and I would like to multiply some columns with the number of days in that month.
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 60 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
Here, I would like to multiply all the columns starting with t with 31. That is, expected output is
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 1680 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
I know that there are some ways using calendar or similar, but given that I'm already using pandas, there must be an easier way - I assume.
There is no such datetime property, but there is an offset M - but I don't know how I would use that without massive inefficiency.
There is now a Series.dt.days_in_month attribute for datetime series. Here is an example based on Jeff's answer.
In [3]: df = pd.DataFrame({'date': pd.date_range('20120101', periods=15, freq='M')})
In [4]: df['year'] = df['date'].dt.year
In [5]: df['month'] = df['date'].dt.month
In [6]: df['days_in_month'] = df['date'].dt.days_in_month
In [7]: df
Out[7]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
pd.tslib.monthrange is an unadvertised / undocumented function that handles the days_in_month calculation (adjusting for leap years). This could/should prob be added as a property to Timestamp/DatetimeIndex.
In [34]: df = DataFrame({'date' : pd.date_range('20120101',periods=15,freq='M') })
In [35]: df['year'] = df['date'].dt.year
In [36]: df['month'] = df['date'].dt.month
In [37]: df['days_in_month'] = df.apply(lambda x: pd.tslib.monthrange(x['year'],x['month'])[1], axis=1)
In [38]: df
Out[38]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
Here is a little clunky hand-made method to get the number of days in a month
import datetime
def days_in_month(dt):
next_month = datetime.datetime(
dt.year + dt.month / 12, dt.month % 12 + 1, 1)
start_month = datetime.datetime(dt.year, dt.month, 1)
td = next_month - start_month
return td.days
For example:
>>> days_in_month(datetime.datetime.strptime('2013-12-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-02-12', '%Y-%m-%d'))
28
>>> days_in_month(datetime.datetime.strptime('2012-02-12', '%Y-%m-%d'))
29
>>> days_in_month(datetime.datetime.strptime('2012-01-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-11-12', '%Y-%m-%d'))
30
I let you figure out how to read your table and do the multiplication yourself :)
import pandas as pd
from pandas.tseries.offsets import MonthEnd
df['dim'] = (pd.to_datetime(df.index) + MonthEnd(0)).dt.day
You can omit pd.to_datetime(), if your index is already DatetimeIndex.

Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum') gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index is not a DatetimeIndex:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper turns out to be much faster than my original
suggestion, using_reset_index:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.
To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.
This should give you the DataFrame you need:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample() as:
... a time-based groupby, followed by a reduction method on each of its groups.
Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.
The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs, and I see there's a new kwarg called "level" that you can use to specify a level in a MultiIndex.
Edit: More details in the "What's New" section.
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack() is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.
Assuming your DataFrame is unindexed. (You can do .reset_index() first), this works as follows:
Group by the non-date columns
Set "Date" as index and resample each chunk
Reassemble using pd.concat
The final code looks like:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})
I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):
Step 1: resample the date but that would give you the date without the other index :
new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()
That would give you the date and its count
Step 2: get the categorical index in the same order as the the date :
col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]
That would give you a new column with the city names and in the same order as the date.
Step 3: merge the dataframes together
new_df=pd.concat([new, col], axis=1)
It's pretty simple, you can make it really shorter tho.

Categories

Resources