Plot overlapped different year trend [duplicate] - python

This question already has answers here:
plot year over year on 12 month axis
(2 answers)
Closed 1 year ago.
I have a data across multiple years like this:
time value
0 2015-01-01 0.982295
1 2015-02-01 3.283557
2 2015-03-01 2.665395
3 2015-04-01 3.124564
4 2015-05-01 4.747362
5 2015-06-01 4.436057
6 2015-07-01 3.925824
7 2015-08-01 4.772219
8 2015-09-01 5.313609
9 2015-10-01 6.213427
10 2015-11-01 6.870897
11 2015-12-01 8.130550
12 2016-01-01 1.984611
13 2016-02-01 1.782809
14 2016-03-01 2.904271
15 2016-04-01 3.029645
16 2016-05-01 3.810806
17 2016-06-01 4.365906
18 2016-07-01 3.922678
19 2016-08-01 4.354115
20 2016-09-01 5.376155
21 2016-10-01 7.290028
22 2016-11-01 6.504523
23 2016-12-01 7.689338
24 2017-01-01 2.158096
25 2017-02-01 1.983260
26 2017-03-01 3.774609
27 2017-04-01 3.570528
28 2017-05-01 3.283161
29 2017-06-01 3.834184
30 2017-07-01 4.388914
31 2017-08-01 5.035261
32 2017-09-01 4.844120
33 2017-10-01 6.206708
34 2017-11-01 6.198993
35 2017-12-01 7.220857
36 2018-01-01 1.346803
37 2018-02-01 2.361194
38 2018-03-01 3.478777
39 2018-04-01 4.093510
40 2018-05-01 3.730770
41 2018-06-01 3.612807
42 2018-07-01 5.524375
43 2018-08-01 5.604300
44 2018-09-01 6.412848
45 2018-10-01 5.463882
46 2018-11-01 6.224526
47 2018-12-01 7.082455
48 2019-01-01 0.893474
49 2019-02-01 1.393201
50 2019-03-01 3.163579
51 2019-04-01 3.506390
52 2019-05-01 3.564924
53 2019-06-01 4.852669
54 2019-07-01 4.087379
55 2019-08-01 4.800931
56 2019-09-01 4.907763
57 2019-10-01 7.235331
58 2019-11-01 6.841004
59 2019-12-01 7.854044
I want to plot the trend of each year overlapped to others.
I tried with this code:
df['year'] = df['time'].dt.year
for year in df['year'].unique():
plt.plot(df[df['year'] == year]['time'],
df[df['year'] == year]['value'])
plt.show()
But I get this:
I want the lines to overlap with the x axis from Jan to Dec.
I already find this question but I do not want to use neither groupby function on my dataframe nor multiindex.

You could extract:
month to be set as the x axis
year to be used as filter
from the time column and use them to plot your data, as in this code:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
df = pd.read_csv('data.csv')
df['time'] = pd.to_datetime(df['time'], format = '%Y-%m-%d')
df['month'] = pd.to_datetime(df['time'].dt.month, format = '%m')
df['year'] = df['time'].dt.year
fig, ax = plt.subplots(figsize = (16, 8))
for year in df['year'].unique():
ax.plot(df[df['year'] == year]['month'],
df[df['year'] == year]['value'],
label = year)
ax.xaxis.set_major_locator(md.MonthLocator())
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
ax.set_xlim([df['month'].iloc[0], df['month'].iloc[-1]])
plt.legend()
plt.show()
which gives this plot:
Or, better, use sns.lineplot in order to avoid the for loop:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
import seaborn as sns
df = pd.read_csv('data.csv')
df['time'] = pd.to_datetime(df['time'], format = '%Y-%m-%d')
df['month'] = pd.to_datetime(df['time'].dt.month, format = '%m')
df['year'] = df['time'].dt.year
fig, ax = plt.subplots(figsize = (16, 8))
palette = sns.color_palette('Set1', n_colors = len(df['year'].unique()))
sns.lineplot(ax = ax,
data = df,
x = 'month',
y = 'value',
hue = 'year',
palette = palette,
ci = None)
ax.xaxis.set_major_locator(md.MonthLocator())
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
ax.set_xlim([df['month'].iloc[0], df['month'].iloc[-1]])
plt.legend()
plt.show()
which gives almost the same plot:

Related

Detrending by date ranges

Considering a df structured like this
Time X
01-01-18 1
01-02-18 20
01-03-18 34
01-04-18 67
01-01-18 89
01-02-18 45
01-03-18 22
01-04-18 1
01-01-19 11
01-02-19 6
01-03-19 78
01-04-19 5
01-01-20 23
01-02-20 6
01-03-20 9
01-04-20 56
01-01-21 78
01-02-21 33
01-03-21 2
01-04-21 67
I want to de-trend the times series from February to April for each year and append it to a new column Y
So far I thought something like this
from datetime import date, timedelta
import pandas as pd
df = pd.read_csv(...)
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['Y'] = np.nan
def daterange(start_date, end_date):
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = df.date(2018, 1, 2)
end_date = df.date(2018, 1, 4)
for date in daterange(start_date, end_date):
df['Y'] = signal.detrend(df['X'])
My concern is that it would iterate over single observations and not over the trend of the selected period. Any way to fix it?
Another issue is how to iterate it over all the years without changing start/end dates each time
When converting strings to the datetime format, you can specify the format directly. infer_datetime_format can mix up day and month.
df['date'] = pd.to_datetime(df['Time'], format='%d-%m-%y')
from scipy.signal import detrend
IIUC, here are two ways to achieve what you want:
1.
    I would prefer this way - using .apply():
def f(df):
result = df['X'].copy()
months = df['date'].dt.month
mask = (months >= 2) & (months <= 4)
result[mask] = detrend(result[mask])
return result
df['new'] = df.groupby(df['date'].dt.year, group_keys=False).apply(f)
2.
    Another way - using .transform():
ser = df['X'].copy()
ser.index = df['date']
def f(s):
result = s.copy()
months = s.index.month
mask = (months >= 2) & (months <= 4)
result[mask] = detrend(result[mask])
return result
new = ser.groupby(ser.index.year).transform(f)
new.index = df.index
df['new'] = new
Result:
date X new
0 2018-01-01 1 1.000000
1 2018-02-01 20 -22.428571
2 2018-03-01 34 -4.057143
3 2018-04-01 67 33.314286
4 2018-01-01 89 89.000000
5 2018-02-01 45 15.685714
6 2018-03-01 22 -2.942857
7 2018-04-01 1 -19.571429
8 2019-01-01 11 11.000000
9 2019-02-01 6 -24.166667
10 2019-03-01 78 48.333333
11 2019-04-01 5 -24.166667
12 2020-01-01 23 23.000000
13 2020-02-01 6 7.333333
14 2020-03-01 9 -14.666667
15 2020-04-01 56 7.333333
16 2021-01-01 78 78.000000
17 2021-02-01 33 16.000000
18 2021-03-01 2 -32.000000
19 2021-04-01 67 16.000000

Сorrect indexing of time intervals

I'm new in pandas and trying to make aggregation. I converted Dataframe to date format and made indexing change for every day.
model['time_only'] = [time.time() for time in model['date']]
model['date_only'] = [date.date() for date in model['date']]
model['cumsum'] = ((model['date_only'].diff() == datetime.timedelta(days=1))*1).cumsum()
def get_out_of_market_data(data):
df = data.copy()
start_market_time = datetime.time(hour=13,minute=30)
end_market_time = datetime.time(hour=20,minute=0)
df['time_only'] = [time.time() for time in df['date']]
df['date_only'] = [date.date() for date in df['date']]
cond = (start_market_time > df['time_only']) | (df['time_only'] >= end_market_time)
return data[cond]
model['date'] = pd.to_datetime(model['date'])
new = model.drop(columns=['time_only', 'date_only'])
get_out_of_market_data(data=new).head(20)
what i get
0 0 65.5000 65.50 65.5000 65.500 DD 1 125 65.500000 2016-01-04 13:15:00 0
26 26 62.7438 62.96 62.6600 62.956 DD 1639 174595 62.781548 2016-01-04 20:00:00 0
27 27 62.5900 62.79 62.5300 62.747 DD 2113 268680 62.650260 2016-01-04 20:15:00 0
28 28 62.7950 62.80 62.5400 62.590 DD 2652 340801 62.652640 2016-01-04 20:30:00 0
29 29 63.1000 63.12 62.7800 62.800 DD 6284 725952 62.963512 2016-01-04 20:45:00 0
30 30 63.2200 63.22 63.0700 63.080 DD 21 699881 63.070114 2016-01-04 21:00:00 0
31 31 63.2200 63.22 63.2200 63.220 DD 7 1973 63.220000 2016-01-04 22:00:00 0
32 32 63.4000 63.40 63.4000 63.400 DD 2 150 63.400000 2016-01-05 00:30:00 1
33 33 62.3700 62.37 62.3700 62.370 DD 3 350 62.370000 2016-01-05 11:00:00 1
34 34 62.1000 62.37 62.1000 62.370 DD 2 300 62.280000 2016-01-05 11:15:00 1
35 35 62.0800 62.08 62.0800 62.080 DD 1 100 62.080000 2016-01-05 11:45:00 1
the last two columns are the time interval from 20:00 to 13:30 with the indexes of change of each day and the indices of change of the day
I tried to group by the last column the interval from 20:00 one day to 13:00 the next with indexing each interval through the groupbuy
I do not fully understand the method, but for example
new.groupby(pd.Grouper(freq='17hours'))
how to move the indexing to this interval ?
You could try creating a new column to represent the market day it belongs to. If the time is less than 13:30:00, it is yesterday's market day, otherwise it is today's market day. Then you can group by it.The code will be:
def get_market_day(dt):
if dt.time() < datetime.time(13, 30, 0):
return dt.date() - datetime.timedelta(days=1)
else:
return dt.date()
df["market_day"] = df["dt"].map(get_market_day)
df.groupby("market_day").agg(...)

Subtract date from datetime columns

I have a dataframe where one of the column ('ProcessingDATE') is datetime format. I want to create another column ('Report Date') where if the processing date is a Monday, subtract 3 days from it, which will end to be a Friday; else subtract 1 day from it.
I've been using python for a short amount of time, so doesn't have a lot of idea about how to write it. My thoughts was to write a for loop with if the cell = Monday, then = datetime.datetime.today() – datetime.timedelta(days=3); else = datetime.datetime.today() – datetime.timedelta(days=1)
for j in range(len(DDA_compamy['ProcessingDATE'])):
if pd.to_datetime(datetime(DDA_company.ProcessingDATE[j])).weekday() == 2
Hope this helps,
from datetime import timedelta
if DDA_compamy['ProcessingDATE'].weekday() == 4: #Condition to check if it is friday
DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=3) # if friday subtracting 3 days
else:
DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=1) #Else one day from the date is subtracted
the above can also be written as,
DDA_compamy['Report Date'] = (DDA_compamy['ProcessingDATE'] - timedelta(days=3)) if (DDA_compamy['ProcessingDATE'].weekday() == 4) else (DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=1))
Use pandas.Series.dt.weekday and some logic:
import pandas as pd
df = pd.DataFrame({'ProcessingDATE':pd.date_range('2019-04-01', '2019-04-27')})
df1 = df.copy()
mask = df1['ProcessingDATE'].dt.weekday == 0
df.loc[mask, 'ProcessingDATE'] = df1['ProcessingDATE'] - pd.to_timedelta('3 days')
df.loc[~mask, 'ProcessingDATE'] = df1['ProcessingDATE'] - pd.to_timedelta('1 days')
Output:
ProcessingDATE
0 2019-03-29
1 2019-04-01
2 2019-04-02
3 2019-04-03
4 2019-04-04
5 2019-04-05
6 2019-04-06
7 2019-04-05
8 2019-04-08
9 2019-04-09
10 2019-04-10
11 2019-04-11
12 2019-04-12
13 2019-04-13
14 2019-04-12
15 2019-04-15
16 2019-04-16
17 2019-04-17
18 2019-04-18
19 2019-04-19
20 2019-04-20
21 2019-04-19
22 2019-04-22
23 2019-04-23
24 2019-04-24
25 2019-04-25
26 2019-04-26

pandas multiple date ranges from column of dates

Current df:
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018
.. ..
I have the df with ID and Date. ID is unique in the original df.
I would like to create a new df based on date. Each ID has a Max Date, I would like to use that date and go back 4 days(5 rows each ID)
There are thousands of IDs.
Expect to get:
ID Date
11 3/15/2018
11 3/16/2018
11 3/17/2018
11 3/18/2018
11 3/19/2018
22 1/1/2018
22 1/2/2018
22 1/3/2018
22 1/4/2018
22 1/5/2018
33 2/8/2018
33 2/9/2018
33 2/10/2018
33 2/11/2018
33 2/12/2018
… …
I tried the following method, i think use date_range might be right direction, but I keep get error.
pd.date_range
def date_list(row):
list = pd.date_range(row["Date"], periods=5)
return list
df["Date_list"] = df.apply(date_list, axis = "columns")
Here is another by using df.assign to overwrite date and pd.concat to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
reindexing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat based solution on keys
Just for fun. My reindex solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
group by ID, select the column Date, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)
You can try groupby with date_range
df.groupby('ID').Date.apply(lambda x : pd.Series(pd.date_range(end=x.iloc[0],periods=5))).reset_index(level=0)
Out[793]:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
0 22 2018-01-01
1 22 2018-01-02
2 22 2018-01-03
3 22 2018-01-04
4 22 2018-01-05
0 33 2018-02-08
1 33 2018-02-09
2 33 2018-02-10
3 33 2018-02-11
4 33 2018-02-12

Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum') gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index is not a DatetimeIndex:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper turns out to be much faster than my original
suggestion, using_reset_index:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.
To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.
This should give you the DataFrame you need:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample() as:
... a time-based groupby, followed by a reduction method on each of its groups.
Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.
The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs, and I see there's a new kwarg called "level" that you can use to specify a level in a MultiIndex.
Edit: More details in the "What's New" section.
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack() is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.
Assuming your DataFrame is unindexed. (You can do .reset_index() first), this works as follows:
Group by the non-date columns
Set "Date" as index and resample each chunk
Reassemble using pd.concat
The final code looks like:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})
I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):
Step 1: resample the date but that would give you the date without the other index :
new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()
That would give you the date and its count
Step 2: get the categorical index in the same order as the the date :
col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]
That would give you a new column with the city names and in the same order as the date.
Step 3: merge the dataframes together
new_df=pd.concat([new, col], axis=1)
It's pretty simple, you can make it really shorter tho.

Categories

Resources