Сorrect indexing of time intervals - python

I'm new in pandas and trying to make aggregation. I converted Dataframe to date format and made indexing change for every day.
model['time_only'] = [time.time() for time in model['date']]
model['date_only'] = [date.date() for date in model['date']]
model['cumsum'] = ((model['date_only'].diff() == datetime.timedelta(days=1))*1).cumsum()
def get_out_of_market_data(data):
df = data.copy()
start_market_time = datetime.time(hour=13,minute=30)
end_market_time = datetime.time(hour=20,minute=0)
df['time_only'] = [time.time() for time in df['date']]
df['date_only'] = [date.date() for date in df['date']]
cond = (start_market_time > df['time_only']) | (df['time_only'] >= end_market_time)
return data[cond]
model['date'] = pd.to_datetime(model['date'])
new = model.drop(columns=['time_only', 'date_only'])
get_out_of_market_data(data=new).head(20)
what i get
0 0 65.5000 65.50 65.5000 65.500 DD 1 125 65.500000 2016-01-04 13:15:00 0
26 26 62.7438 62.96 62.6600 62.956 DD 1639 174595 62.781548 2016-01-04 20:00:00 0
27 27 62.5900 62.79 62.5300 62.747 DD 2113 268680 62.650260 2016-01-04 20:15:00 0
28 28 62.7950 62.80 62.5400 62.590 DD 2652 340801 62.652640 2016-01-04 20:30:00 0
29 29 63.1000 63.12 62.7800 62.800 DD 6284 725952 62.963512 2016-01-04 20:45:00 0
30 30 63.2200 63.22 63.0700 63.080 DD 21 699881 63.070114 2016-01-04 21:00:00 0
31 31 63.2200 63.22 63.2200 63.220 DD 7 1973 63.220000 2016-01-04 22:00:00 0
32 32 63.4000 63.40 63.4000 63.400 DD 2 150 63.400000 2016-01-05 00:30:00 1
33 33 62.3700 62.37 62.3700 62.370 DD 3 350 62.370000 2016-01-05 11:00:00 1
34 34 62.1000 62.37 62.1000 62.370 DD 2 300 62.280000 2016-01-05 11:15:00 1
35 35 62.0800 62.08 62.0800 62.080 DD 1 100 62.080000 2016-01-05 11:45:00 1
the last two columns are the time interval from 20:00 to 13:30 with the indexes of change of each day and the indices of change of the day
I tried to group by the last column the interval from 20:00 one day to 13:00 the next with indexing each interval through the groupbuy
I do not fully understand the method, but for example
new.groupby(pd.Grouper(freq='17hours'))
how to move the indexing to this interval ?

You could try creating a new column to represent the market day it belongs to. If the time is less than 13:30:00, it is yesterday's market day, otherwise it is today's market day. Then you can group by it.The code will be:
def get_market_day(dt):
if dt.time() < datetime.time(13, 30, 0):
return dt.date() - datetime.timedelta(days=1)
else:
return dt.date()
df["market_day"] = df["dt"].map(get_market_day)
df.groupby("market_day").agg(...)

Related

Detrending by date ranges

Considering a df structured like this
Time X
01-01-18 1
01-02-18 20
01-03-18 34
01-04-18 67
01-01-18 89
01-02-18 45
01-03-18 22
01-04-18 1
01-01-19 11
01-02-19 6
01-03-19 78
01-04-19 5
01-01-20 23
01-02-20 6
01-03-20 9
01-04-20 56
01-01-21 78
01-02-21 33
01-03-21 2
01-04-21 67
I want to de-trend the times series from February to April for each year and append it to a new column Y
So far I thought something like this
from datetime import date, timedelta
import pandas as pd
df = pd.read_csv(...)
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['Y'] = np.nan
def daterange(start_date, end_date):
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = df.date(2018, 1, 2)
end_date = df.date(2018, 1, 4)
for date in daterange(start_date, end_date):
df['Y'] = signal.detrend(df['X'])
My concern is that it would iterate over single observations and not over the trend of the selected period. Any way to fix it?
Another issue is how to iterate it over all the years without changing start/end dates each time
When converting strings to the datetime format, you can specify the format directly. infer_datetime_format can mix up day and month.
df['date'] = pd.to_datetime(df['Time'], format='%d-%m-%y')
from scipy.signal import detrend
IIUC, here are two ways to achieve what you want:
1.
    I would prefer this way - using .apply():
def f(df):
result = df['X'].copy()
months = df['date'].dt.month
mask = (months >= 2) & (months <= 4)
result[mask] = detrend(result[mask])
return result
df['new'] = df.groupby(df['date'].dt.year, group_keys=False).apply(f)
2.
    Another way - using .transform():
ser = df['X'].copy()
ser.index = df['date']
def f(s):
result = s.copy()
months = s.index.month
mask = (months >= 2) & (months <= 4)
result[mask] = detrend(result[mask])
return result
new = ser.groupby(ser.index.year).transform(f)
new.index = df.index
df['new'] = new
Result:
date X new
0 2018-01-01 1 1.000000
1 2018-02-01 20 -22.428571
2 2018-03-01 34 -4.057143
3 2018-04-01 67 33.314286
4 2018-01-01 89 89.000000
5 2018-02-01 45 15.685714
6 2018-03-01 22 -2.942857
7 2018-04-01 1 -19.571429
8 2019-01-01 11 11.000000
9 2019-02-01 6 -24.166667
10 2019-03-01 78 48.333333
11 2019-04-01 5 -24.166667
12 2020-01-01 23 23.000000
13 2020-02-01 6 7.333333
14 2020-03-01 9 -14.666667
15 2020-04-01 56 7.333333
16 2021-01-01 78 78.000000
17 2021-02-01 33 16.000000
18 2021-03-01 2 -32.000000
19 2021-04-01 67 16.000000

melt columns and add 20 minutes to each row in date column

I'm trying to take this dataframe(with 1 row in this example):
id Date value_now value+20min value+60min value+80min
0 2015-01-11 00:00:01 12 15 18 22
and to transform it to this:
id Date Value
0 2015-01-11 00:00:01 12
0 2015-01-11 00:20:01 15
0 2015-01-11 00:40:01 18
0 2015-01-11 01:00:01 22
as you can see I need to change the value in respond to the columns and create rows, I understood I can do it using melt, but I'm having hard time doing it.
Please help me with that.....
Thank you!
you can melt the dataframe then use the variable column and split on + then use the right side of the split and convert to timedelta and add them back to date:
final = df.melt(['id','Date'])
final['Date'] += pd.to_timedelta(final['variable'].str.split('+').str[1].fillna('0min'))
print(final.drop('variable',1))
id Date value
0 0 2015-01-11 00:00:01 12
1 0 2015-01-11 00:20:01 15
2 0 2015-01-11 00:40:01 18
3 0 2015-01-11 01:20:01 22
Another way proposed by #YOBEN_S where you can find the numeric in the variable column and convert to timedelta and add with the Date with df.assign:
final1 = (df.melt(['id','Date']).assign(Date=lambda x :
x['Date']+pd.to_timedelta(x['variable'].str.findall(r'\d+')
.str[0].fillna(0).astype(float),unit='min')))
Here's one approach:
out = df.melt(id_vars=['id', 'Date'])
minutes = pd.to_numeric(out.variable.str.rsplit('+',1).str[-1]
.str.rstrip('min'),
errors='coerce')
out['Date'] = pd.to_datetime(out.Date)
out['Date'] = out.Date + pd.to_timedelta(minutes.fillna(0), unit='m')
print(out.drop('variable',1))
id Date value
0 2015-01-11 2020-02-14 00:00:01 12
1 2015-01-11 2020-02-14 00:20:01 15
2 2015-01-11 2020-02-14 00:40:01 18
3 2015-01-11 2020-02-14 01:20:01 22

How to continue the week number when the year changes using pandas

Example: By using
df['Week_Number'] = df['Date'].dt.strftime('%U')
for 29/12/2019 the week is 52. and this week is from 29/12/2019 to 04/01/2020.
but for 01/01/2020 the week is getting as 00.
I require the week for 01/01/2020 also as 52. and for 05/01/2020 to 11/01/2020 as 53. This need to be continued.
I used a logic to solve the question.
First of all, let's write a function to create an instance of Dataframe involving dates from 2019-12-01 to 2020-01-31 by a function
def create_date_table(start='2019-12-01', end='2020-01-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Week_start_from_Monday"] = df.Date.dt.isocalendar().week
df['Week_start_from_Sunday'] = df['Date'].dt.strftime('%U')
return df
Run the function and observe the Dataframe
date_df=create_date_table()
date_df.head(n=40)
There are two fields in the Dataframe about weeks, Week_start_from_Monday and Week_start_from_Sunday, the difference come from they count Monday or Sunday as the first day of a week.
In this case, Week_start_from_Sunday is the one we need to focus on.
Now we write a function to add a column containing weeks continuing from last year, not reset to 00 when we enter a new year.
def add_continued_week_field(date: Timestamp, df_start_date: str = '2019-12-01') -> int:
start_date = datetime.strptime(df_start_date, '%Y-%m-%d')
year_of_start_date = start_date.year
year_of_date = date.year
week_of_date = date.strftime("%U")
year_diff = year_of_date - year_of_start_date
if year_diff == 0:
continued_week = int(week_of_date)
else:
continued_week = year_diff * 52 + int(week_of_date)
return continued_week
Let's apply the function add_continued_week_field to the dates' Dataframe.
date_df['Week_continue'] = date_df['Date'].apply(add_continued_week_field)
We can see the new added field in the dates' Dataframe
As stated in converting a pandas date to week number, you can use df['Date'].dt.week to get week numbers.
To let it continue you maybe could sum up the last week number with new week-values, something like this? I cannot test this right now...
if(df['Date'].dt.strftime('%U') == 53):
last = df['Date'].dt.strftime('%U')
df['Week_Number'] = last + df['Date'].dt.strftime('%U')
You can do this with isoweek and isoyear.
I don't see how you arrive at the values you present with '%U' so I will assume that you want to map the week starting on Sunday 2019-12-29 ending on 2020-01-04 to 53, and that you want to map the following week to 54 and so on.
For weeks to continue past the year you need isoweek.
isocalendar() provides a tuple with isoweek in the second element and a corresponding unique isoyear in the first element.
But isoweek starts on Monday so we have to add one day so the Sunday is interpreted as Monday and counted to the right week.
2019 is subtracted to have years starting from 0, then every year is multiplied with 53 and the isoweek is added. Finally there is an offset of 1 so you arrive at 53.
In [0]: s=pd.Series(["29/12/2019", "01/01/2020", "05/01/2020", "11/01/2020"])
dts = pd.to_datetime(s,infer_datetime_format=True)
In [0]: (dts + pd.DateOffset(days=1)).apply(lambda x: (x.isocalendar()[0] -2019)*53 + x.isocalendar()[1] -1)
Out[0]:
0 53
1 53
2 54
3 54
dtype: int64
This of course assumes that all iso years have 53 weeks which is not the case, so instead you would want to compute the number of iso weeks per iso year since 2019 and sum those up.
Maybe you are looking for this. I fixed an epoch. If you have dates earlier than 2019, you can choose other epoch.
epoch= pd.Timestamp("2019-12-23")
# Test data:
df=pd.DataFrame({"Date":pd.date_range("22/12/2019",freq="1D",periods=25)})
df["Day_name"]=df.Date.dt.day_name()
# Calculation:
df["Week_Number"]=np.where(df.Date.astype("datetime64").le(epoch), \
df.Date.dt.week, \
df.Date.sub(epoch).dt.days//7+52)
df
Date Day_name Week_Number
0 2019-12-22 Sunday 51
1 2019-12-23 Monday 52
2 2019-12-24 Tuesday 52
3 2019-12-25 Wednesday 52
4 2019-12-26 Thursday 52
5 2019-12-27 Friday 52
6 2019-12-28 Saturday 52
7 2019-12-29 Sunday 52
8 2019-12-30 Monday 53
9 2019-12-31 Tuesday 53
10 2020-01-01 Wednesday 53
11 2020-01-02 Thursday 53
12 2020-01-03 Friday 53
13 2020-01-04 Saturday 53
14 2020-01-05 Sunday 53
15 2020-01-06 Monday 54
16 2020-01-07 Tuesday 54
17 2020-01-08 Wednesday 54
18 2020-01-09 Thursday 54
19 2020-01-10 Friday 54
20 2020-01-11 Saturday 54
21 2020-01-12 Sunday 54
22 2020-01-13 Monday 55
23 2020-01-14 Tuesday 55
24 2020-01-15 Wednesday 55
I got here wanting to know how to label consecutive weeks - I'm not sure if that's exactly what the question is asking but I think it might be. So here is what I came up with:
# Create dataframe with example dates
# It has a datetime index and a column with day of week (just to check that it's working)
dates = pd.date_range('2019-12-15','2020-01-10')
df = pd.DataFrame(dates.dayofweek,index=dates,columns=['dow'])
# Add column
# THESE ARE THE RELEVANT LINES
woy = df.index.weekofyear
numbered = np.cumsum(np.diff(woy,prepend=woy[0])!=0)
# Append for easier comparison
df['week_num'] = numbered
df then looks like this:
dow week_num
2019-12-15 6 0
2019-12-16 0 1
2019-12-17 1 1
2019-12-18 2 1
2019-12-19 3 1
2019-12-20 4 1
2019-12-21 5 1
2019-12-22 6 1
2019-12-23 0 2
2019-12-24 1 2
2019-12-25 2 2
2019-12-26 3 2
2019-12-27 4 2
2019-12-28 5 2
2019-12-29 6 2
2019-12-30 0 3
2019-12-31 1 3
2020-01-01 2 3
2020-01-02 3 3
2020-01-03 4 3
2020-01-04 5 3
2020-01-05 6 3
2020-01-06 0 4
2020-01-07 1 4
2020-01-08 2 4
2020-01-09 3 4
2020-01-10 4 4

Subtract date from datetime columns

I have a dataframe where one of the column ('ProcessingDATE') is datetime format. I want to create another column ('Report Date') where if the processing date is a Monday, subtract 3 days from it, which will end to be a Friday; else subtract 1 day from it.
I've been using python for a short amount of time, so doesn't have a lot of idea about how to write it. My thoughts was to write a for loop with if the cell = Monday, then = datetime.datetime.today() – datetime.timedelta(days=3); else = datetime.datetime.today() – datetime.timedelta(days=1)
for j in range(len(DDA_compamy['ProcessingDATE'])):
if pd.to_datetime(datetime(DDA_company.ProcessingDATE[j])).weekday() == 2
Hope this helps,
from datetime import timedelta
if DDA_compamy['ProcessingDATE'].weekday() == 4: #Condition to check if it is friday
DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=3) # if friday subtracting 3 days
else:
DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=1) #Else one day from the date is subtracted
the above can also be written as,
DDA_compamy['Report Date'] = (DDA_compamy['ProcessingDATE'] - timedelta(days=3)) if (DDA_compamy['ProcessingDATE'].weekday() == 4) else (DDA_compamy['Report Date']=DDA_compamy['ProcessingDATE'] - timedelta(days=1))
Use pandas.Series.dt.weekday and some logic:
import pandas as pd
df = pd.DataFrame({'ProcessingDATE':pd.date_range('2019-04-01', '2019-04-27')})
df1 = df.copy()
mask = df1['ProcessingDATE'].dt.weekday == 0
df.loc[mask, 'ProcessingDATE'] = df1['ProcessingDATE'] - pd.to_timedelta('3 days')
df.loc[~mask, 'ProcessingDATE'] = df1['ProcessingDATE'] - pd.to_timedelta('1 days')
Output:
ProcessingDATE
0 2019-03-29
1 2019-04-01
2 2019-04-02
3 2019-04-03
4 2019-04-04
5 2019-04-05
6 2019-04-06
7 2019-04-05
8 2019-04-08
9 2019-04-09
10 2019-04-10
11 2019-04-11
12 2019-04-12
13 2019-04-13
14 2019-04-12
15 2019-04-15
16 2019-04-16
17 2019-04-17
18 2019-04-18
19 2019-04-19
20 2019-04-20
21 2019-04-19
22 2019-04-22
23 2019-04-23
24 2019-04-24
25 2019-04-25
26 2019-04-26

Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum') gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index is not a DatetimeIndex:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper turns out to be much faster than my original
suggestion, using_reset_index:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.
To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.
This should give you the DataFrame you need:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample() as:
... a time-based groupby, followed by a reduction method on each of its groups.
Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.
The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs, and I see there's a new kwarg called "level" that you can use to specify a level in a MultiIndex.
Edit: More details in the "What's New" section.
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack() is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.
Assuming your DataFrame is unindexed. (You can do .reset_index() first), this works as follows:
Group by the non-date columns
Set "Date" as index and resample each chunk
Reassemble using pd.concat
The final code looks like:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})
I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):
Step 1: resample the date but that would give you the date without the other index :
new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()
That would give you the date and its count
Step 2: get the categorical index in the same order as the the date :
col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]
That would give you a new column with the city names and in the same order as the date.
Step 3: merge the dataframes together
new_df=pd.concat([new, col], axis=1)
It's pretty simple, you can make it really shorter tho.

Categories

Resources