I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.
First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.
a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.
b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.
c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.
d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.
e) We start the new cycle for new PC_id as in row six.
Datetime Item_id PC_id Value Score Date_of_interest
0 1998-04-8 01:00:00 1 200 35 1 NaN
1 1998-04-8 02:00:00 1 200 92 1 NaN
2 1998-04-10 02:00:00 2 200 35 1 NaN
3 1998-04-10 02:15:00 2 200 92 1 NaN
4 1998-04-10 02:25:00 3 200 92 2 1998-04-10 02:25:00
5 1998-04-10 03:00:00 1 201 93 1 NaN
6 1998-04-12 03:30:00 3 201 94 1 NaN
7 1998-04-12 04:00:00 4 201 95 2 NaN
8 1998-04-12 04:00:00 4 201 26 2 1998-04-12 04:00:00
9 1998-04-12 04:30:00 2 201 98 3 NaN
10 1998-04-12 04:50:00 1 202 100 1 NaN
11 1998-04-15 05:00:00 4 202 100 1 NaN
12 1998-04-15 05:15:00 3 202 100 2 1998-04-15 05:15:00
13 1998-04-15 05:30:00 2 202 100 3 NaN
14 1998-04-15 06:00:00 3 202 100 NaN NaN
15 1998-04-15 06:00:00 3 202 222 NaN NaN
Final table should be as follows:
PC_id Date_of_interest
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
Thanks for helping.
Update : Code I am working on currently:
df_merged_unique = df_merged['PC_id'].unique()
score = 0
for i, row in df_merged.iterrows():
for elem in df_merged_unique:
first_date = row['Datetime']
first_item = 0
if row['PC_id'] == elem:
if row['Score'] < 2:
if row['Item_id'] != first_item:
if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
score += 1
row['Score'] = score
first_date = row['Datetime']
else:
pass
else:
pass
else:
row['Date_of_interest'] = row['Datetime']
break
else:
pass
Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas. Given the dataframe
In [111]: df2
Out[111]:
Datetime Item_id PC_id Value
0 1998-04-08 01:00:00 1 200 35
1 1998-04-08 02:00:00 1 200 92
2 1998-04-10 02:00:00 2 200 35
3 1998-04-10 02:15:00 2 200 92
4 1998-04-10 02:25:00 3 200 92
5 1998-04-10 03:00:00 1 201 93
6 1998-04-12 03:30:00 3 201 94
7 1998-04-12 04:00:00 4 201 95
8 1998-04-12 04:00:00 4 201 26
9 1998-04-12 04:30:00 2 201 98
10 1998-04-12 04:50:00 1 202 100
11 1998-04-15 05:00:00 4 202 100
12 1998-04-15 05:15:00 3 202 100
13 1998-04-15 05:30:00 2 202 100
14 1998-04-15 06:00:00 3 202 100
15 1998-04-15 06:00:00 3 202 222
you could first group by PC_id
In [112]: the_group = df2.groupby('PC_id')
and then apply the search using diff() to get the rows where Item_id and Datetime change appropriately
In [357]: (the_group['Item_id'].diff() != 0) & \
...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 True
10 False
11 False
12 True
13 True
14 True
15 False
16 False
dtype: bool
and then just take the first date (first match) in each group, if any
In [341]: df2[(the_group['Item_id'].diff() != 0) &
...: (the_group['Datetime'].diff() <= timedelta(days=1))]\
...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]:
PC_id Datetime
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
Related
I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.
This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
I have the following df of values for various slices across time:
date A B C
0 2016-01-01 5 7 2
1 2016-01-02 6 12 15
...
2 2016-01-08 9 5 16
...
3 2016-12-24 5 11 13
4 2016-12-31 3 52 22
I would like to create a new dataframe that calculates the w-w change in each slice, by date. For example, I want the new table to be blank for all slices from jan 1 - jan 7. I want the value of jan 8 to be the jan 8 value for the given slice minus the value of the jan 1 value of that slice. I then want the value of jan 9 to be the jan 9 value for the given slice minus the value of the jan 2 slice. So and so forth, all the way down.
The example table would look like this:
date A B C
0 2016-01-01 0 0 0
1 2016-01-02 0 0 0
...
2 2016-01-08 4 -2 14
...
3 2016-12-24 4 12 2
4 2016-12-31 -2 41 9
You may assume the offset is ALWAYS 7. In other words, there are no missing dates.
#Unatiel's answer is correct in this case, where there are no missing dates, and should be accepted.
But I wanted to post a modification here for cases with missing dates, for anyone interested. From the docs:
The shift method accepts a freq argument which can accept a
DateOffset class or other timedelta-like object or also a offset
alias
from pandas.tseries.offsets import Week
res = ((df - df.shift(1, freq=Week()).reindex(df.index))
.fillna(value=0)
.astype(int))
print(res)
A B
date
2016-01-01 0 0
2016-01-02 0 0
2016-01-03 0 0
2016-01-04 0 0
2016-01-05 0 0
2016-01-06 0 0
2016-01-07 0 0
2016-01-08 31 46
2016-01-09 4 20
2016-01-10 -51 -65
2016-01-11 56 5
2016-01-12 -51 24
.. ..
2016-01-20 34 -30
2016-01-21 -28 19
2016-01-22 24 8
2016-01-23 -28 -46
2016-01-24 -11 -60
2016-01-25 -34 -7
2016-01-26 -12 -28
2016-01-27 -41 42
2016-01-28 -2 48
2016-01-29 35 -51
2016-01-30 -8 62
2016-01-31 -6 -9
If we know offset is always 7 then use shift(), here is a quick example showing how it works :
df = pandas.DataFrame({'x': range(30)})
df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 0.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
...
So with this you can do :
df - df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 7.0
8 7.0
...
In your case, don't forget to set_index('date') before.