Pandas groupBy with conditional grouping

Pandas groupBy with conditional grouping - python

I have two data frames and need to group the first one based on some criteria from the second df.
df1=
summary participant_id response_date
0 2.0 11 2016-04-30
1 3.0 11 2016-05-01
2 3.0 11 2016-05-02
3 3.0 11 2016-05-03
4 3.0 11 2016-05-04
5 3.0 11 2016-05-05
6 3.0 11 2016-05-06
7 4.0 11 2016-05-07
8 4.0 11 2016-05-08
9 3.0 11 2016-05-09
10 3.0 11 2016-05-10
11 3.0 11 2016-05-11
12 3.0 11 2016-05-12
13 3.0 11 2016-05-13
14 3.0 11 2016-05-14
15 3.0 11 2016-05-15
16 3.0 11 2016-05-16
17 4.0 11 2016-05-17
18 3.0 11 2016-05-18
19 3.0 11 2016-05-19
20 3.0 11 2016-05-20
21 4.0 11 2016-05-21
22 4.0 11 2016-05-22
23 4.0 11 2016-05-23
24 3.0 11 2016-05-24
25 3.0 11 2016-05-25
26 3.0 11 2016-05-26
27 3.0 11 2016-05-27
28 3.0 11 2016-05-28
29 3.0 11 2016-05-29
.. ... ... ...
df2 =
summary participant_id response_date
0 12.0 11 2016-04-30
1 12.0 11 2016-05-14
2 14.0 11 2016-05-28
. ... ... ...
I need to group (get blocks) of df1 between the dates in the column of df2. Namely:
df1=
summary participant_id response_date
2.0 11 2016-04-30
3.0 11 2016-05-01
3.0 11 2016-05-02
3.0 11 2016-05-03
3.0 11 2016-05-04
3.0 11 2016-05-05
3.0 11 2016-05-06
4.0 11 2016-05-07
4.0 11 2016-05-08
3.0 11 2016-05-09
3.0 11 2016-05-10
3.0 11 2016-05-11
3.0 11 2016-05-12
3.0 11 2016-05-13
3.0 11 2016-05-14
3.0 11 2016-05-15
3.0 11 2016-05-16
4.0 11 2016-05-17
3.0 11 2016-05-18
3.0 11 2016-05-19
3.0 11 2016-05-20
4.0 11 2016-05-21
4.0 11 2016-05-22
4.0 11 2016-05-23
3.0 11 2016-05-24
3.0 11 2016-05-25
3.0 11 2016-05-26
3.0 11 2016-05-27
3.0 11 2016-05-28
3.0 11 2016-05-29
.. ... ... ...
Is there an elegant solution with groupby?

There might be a more elegant solution but you can loop through the response_date values in df2 and create a boolean series of values by checking against the all the response_date values in df1 and simply summing them all up.
df1['group'] = 0
for rd in df2.response_date.values:
df1['group'] += df1.response_date > rd
Output:
summary participant_id response_date group
0 2.0 11 2016-04-30 0
1 3.0 11 2016-05-01 1
2 3.0 11 2016-05-02 1
3 3.0 11 2016-05-03 1
4 3.0 11 2016-05-04 1
Building off of #Scott's answer:
You can use pd.cut but you will need to add a date before the earliest date and after the latest date in response_date from df2
dates = [pd.Timestamp('2000-1-1')] +
df2.response_date.sort_values().tolist() +
[pd.Timestamp('2020-1-1')]
df1['group'] = pd.cut(df1['response_date'], dates)

You want the .cut method. This lets you bin your dates by some other list of dates.
df1['cuts'] = pd.cut(df1['response_date'], df2['response_date'])
grouped = df1.groupby('cuts')
print grouped.max() #for example

Related

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

This question already has answers here:
Merge multiple DataFrames Pandas
(5 answers)
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have three dataframes
Dataframe df1:
date A
0 2022-04-11 1
1 2022-04-12 2
2 2022-04-14 26
3 2022-04-16 2
4 2022-04-17 1
5 2022-04-20 17
6 2022-04-21 14
7 2022-04-22 1
8 2022-04-23 9
9 2022-04-24 1
10 2022-04-25 5
11 2022-04-26 2
12 2022-04-27 21
13 2022-04-28 9
14 2022-04-29 17
15 2022-04-30 5
16 2022-05-01 8
17 2022-05-07 1241217
18 2022-05-08 211
19 2022-05-09 1002521
20 2022-05-10 488739
21 2022-05-11 12925
22 2022-05-12 57
23 2022-05-13 8515098
24 2022-05-14 1134576
Dateframe df2:
date B
0 2022-04-12 8
1 2022-04-14 7
2 2022-04-16 2
3 2022-04-19 2
4 2022-04-23 2
5 2022-05-07 2
6 2022-05-08 5
7 2022-05-09 2
8 2022-05-14 1
Dataframe df3:
date C
0 2022-04-12 6
1 2022-04-13 1
2 2022-04-14 2
3 2022-04-20 3
4 2022-04-21 9
5 2022-04-22 25
6 2022-04-23 56
7 2022-04-24 49
8 2022-04-25 68
9 2022-04-26 71
10 2022-04-27 40
11 2022-04-28 44
12 2022-04-29 27
13 2022-04-30 34
14 2022-05-01 28
15 2022-05-07 9
16 2022-05-08 20
17 2022-05-09 24
18 2022-05-10 21
19 2022-05-11 8
20 2022-05-12 8
21 2022-05-13 14
22 2022-05-14 25
23 2022-05-15 43
24 2022-05-16 36
25 2022-05-17 29
26 2022-05-18 28
27 2022-05-19 17
28 2022-05-20 6
I would like to merge df1, df2, df3 in a single dataframe with columns date, A, B, C, in such a way that date contains all dates which appeared in df1 and/or df2 and/or df3 (without repetition), and if a particular date was not in any of the dataframes, then for the respective column I put value 0.0. So, I would like to have something like that:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-08-12 2.0 8.0 6.0
2 2022-08-13 0.0 0.0 1.0
...
I tried to use this method
merge1 = pd.merge(df1, df2, how='outer')
sorted_merge1 = merge1.sort_values(by=['date'], ascending=False)
full_merge = pd.merge(sorted_merg1, df3, how='outer')
However, it seems it skips the dates which are not common for all three dataframes.

Try this,
print(pd.merge(df1, df2, on='date', how='outer').merge(df3, on='date', how='outer').fillna(0))
O/P:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-04-12 2.0 8.0 6.0
2 2022-04-14 26.0 7.0 2.0
3 2022-04-16 2.0 2.0 0.0
4 2022-04-17 1.0 0.0 0.0
5 2022-04-20 17.0 0.0 3.0
6 2022-04-21 14.0 0.0 9.0
7 2022-04-22 1.0 0.0 25.0
8 2022-04-23 9.0 2.0 56.0
9 2022-04-24 1.0 0.0 49.0
10 2022-04-25 5.0 0.0 68.0
11 2022-04-26 2.0 0.0 71.0
12 2022-04-27 21.0 0.0 40.0
13 2022-04-28 9.0 0.0 44.0
14 2022-04-29 17.0 0.0 27.0
15 2022-04-30 5.0 0.0 34.0
16 2022-05-01 8.0 0.0 28.0
17 2022-05-07 1241217.0 2.0 9.0
18 2022-05-08 211.0 5.0 20.0
19 2022-05-09 1002521.0 2.0 24.0
20 2022-05-10 488739.0 0.0 21.0
21 2022-05-11 12925.0 0.0 8.0
22 2022-05-12 57.0 0.0 8.0
23 2022-05-13 8515098.0 0.0 14.0
24 2022-05-14 1134576.0 1.0 25.0
25 2022-04-19 0.0 2.0 0.0
26 2022-04-13 0.0 0.0 1.0
27 2022-05-15 0.0 0.0 43.0
28 2022-05-16 0.0 0.0 36.0
29 2022-05-17 0.0 0.0 29.0
30 2022-05-18 0.0 0.0 28.0
31 2022-05-19 0.0 0.0 17.0
32 2022-05-20 0.0 0.0 6.0

perform merge chain and fill NaN with 0

Calculated column with shift

This is the base DataFrame:
g_accessor number_opened number_closed
0 49 - 20 3.0 1.0
1 50 - 20 2.0 14.0
2 51 - 20 1.0 6.0
3 52 - 20 0.0 6.0
4 1 - 21 1.0 4.0
5 2 - 21 3.0 5.0
6 3 - 21 4.0 11.0
7 4 - 21 2.0 7.0
8 5 - 21 6.0 10.0
9 6 - 21 2.0 8.0
10 7 - 21 4.0 9.0
11 8 - 21 2.0 3.0
12 9 - 21 2.0 1.0
13 10 - 21 1.0 11.0
14 11 - 21 6.0 3.0
15 12 - 21 3.0 3.0
16 13 - 21 2.0 6.0
17 14 - 21 5.0 9.0
18 15 - 21 9.0 13.0
19 16 - 21 7.0 7.0
20 17 - 21 9.0 4.0
21 18 - 21 3.0 8.0
22 19 - 21 6.0 3.0
23 20 - 21 6.0 1.0
24 21 - 21 3.0 5.0
25 22 - 21 5.0 3.0
26 23 - 21 1.0 0.0
I want to add a calculated new column number_active which relies on previous values. For this I'm trying to use pd.DataFrame.shift(), like this:
# Creating new column and setting all rows to 0
df['number_active'] = 0
# Active from previous period
PREVIOUS_PERIOD_ACTIVE = 22
# Calculating active value for first period in the DataFrame, based on `PREVIOUS_PERIOD_ACTIVE`
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
# Calculating all columns using DataFrame.shift()
df['number_active'] = (df['number_opened'] + df['number_active'].shift(1)) - df['number_closed']
# Recalculating first active value as it was overwritten in the previous step.
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
The result:
g_accessor number_opened number_closed number_active
0 49 - 20 3.0 1.0 24.0
1 50 - 20 2.0 14.0 12.0
2 51 - 20 1.0 6.0 -5.0
3 52 - 20 0.0 6.0 -6.0
4 1 - 21 1.0 4.0 -3.0
5 2 - 21 3.0 5.0 -2.0
6 3 - 21 4.0 11.0 -7.0
7 4 - 21 2.0 7.0 -5.0
8 5 - 21 6.0 10.0 -4.0
9 6 - 21 2.0 8.0 -6.0
10 7 - 21 4.0 9.0 -5.0
11 8 - 21 2.0 3.0 -1.0
12 9 - 21 2.0 1.0 1.0
13 10 - 21 1.0 11.0 -10.0
14 11 - 21 6.0 3.0 3.0
15 12 - 21 3.0 3.0 0.0
16 13 - 21 2.0 6.0 -4.0
17 14 - 21 5.0 9.0 -4.0
18 15 - 21 9.0 13.0 -4.0
19 16 - 21 7.0 7.0 0.0
20 17 - 21 9.0 4.0 5.0
21 18 - 21 3.0 8.0 -5.0
22 19 - 21 6.0 3.0 3.0
23 20 - 21 6.0 1.0 5.0
24 21 - 21 3.0 5.0 -2.0
25 22 - 21 5.0 3.0 2.0
26 23 - 21 1.0 0.0 1.0
Oddly, it seems that only the first active value (index 1) is calculated correctly (since the value at index 0 is calculated independently, via df.iat). For the rest of the values it seems that number_closed is interpreted as negative value - for some reason.
What am I missing/doing wrong?

You are assuming that the result for the previous row is available when the current row is calculated. This is not how pandas calculations work. Pandas calculations treat each row in isolation, unless you are applying multi-row operations like cumsum and shift.
I would calculate the number active with a minimal example as:
df = pandas.DataFrame({'ignore': ['a','b','c','d','e'], 'number_opened': [3,4,5,4,3], 'number_closed':[1,2,2,1,2]})
df['number_active'] = df['number_opened'].cumsum() + 22 - df['number_closed'].cumsum()
This gives a result of:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
29
3
d
4
1
32
4
e
3
2
33
The code in your question with my minimal example gave:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
3
3
d
4
1
3
4
e
3
2
1

Convert column vector into multi-column matrix

I have a column vector with say 30 values (1-30) I would like to try to manipulate this vector so that it becomes a matrix with 5 values in the first column, 10 values in the second and 15 values in the third column. How would I implement this using Pandas or NumPy?
import pandas as pd
#Create data
df = pd.DataFrame(np.linspace(1,20,20))
print(df)
1
2
:
28
29
30
In order to get something like this:
# Manipulate the column vector to make columns where the first column has 5
# the second column has 10 and the last column has 15 values
'T1' 'T2' 'T3'
1 6 16
2 7 17
3 8 18
4 9 19
5 10 20
NA 11 21
NA 12 22
NA 13 23
NA 14 24
NA 15 25
NA NA 26
NA NA 27
NA NA 28
NA NA 29
NA NA 30

It took a little time to find out what series is this, and I found that its a triangular series , just a modified one.
tri = lambda x:int((0.25+2*x)**0.5-0.5)
This would give results like:
0 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 ...
And after the modification:
modtri = lambda x:int((0.25+2*(x//5))**0.5-0.5)
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
So each occurrence in normal triangular series repeats 5 times.
The above modtri function would directly map the index starting from 0, to appropriate group ids.
and so after that, this would do the job:
df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
Full execution:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.linspace(1,30,30))
N = 5 #the increment value
modtri = lambda x:int((0.25+2*(x//N))**0.5-0.5)
df2 = df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
df2.rename(columns={0: "T1", 1: "T2",2:"T3"},inplace=True)
print(df2)
Output:
T1 T2 T3
0 1.0 6.0 16.0
1 2.0 7.0 17.0
2 3.0 8.0 18.0
3 4.0 9.0 19.0
4 5.0 10.0 20.0
5 NaN 11.0 21.0
6 NaN 12.0 22.0
7 NaN 13.0 23.0
8 NaN 14.0 24.0
9 NaN 15.0 25.0
10 NaN NaN 26.0
11 NaN NaN 27.0
12 NaN NaN 28.0
13 NaN NaN 29.0
14 NaN NaN 30.0

Try this by slicing with reindexing:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
Original data before operation:
df = pd.DataFrame(np.linspace(1,30,30))
print(df)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 20.0
20 21.0
21 22.0
22 23.0
23 24.0
24 25.0
25 26.0
26 27.0
27 28.0
28 29.0
29 30.0
Running new codes:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
print(df)
0 T1 T2 T3
0 1.0 1.0 6.0 16.0
1 2.0 2.0 7.0 17.0
2 3.0 3.0 8.0 18.0
3 4.0 4.0 9.0 19.0
4 5.0 5.0 10.0 20.0
5 6.0 NaN 11.0 21.0
6 7.0 NaN 12.0 22.0
7 8.0 NaN 13.0 23.0
8 9.0 NaN 14.0 24.0
9 10.0 NaN 15.0 25.0
10 11.0 NaN NaN 26.0
11 12.0 NaN NaN 27.0
12 13.0 NaN NaN 28.0
13 14.0 NaN NaN 29.0
14 15.0 NaN NaN 30.0
15 16.0 NaN NaN NaN
16 17.0 NaN NaN NaN
17 18.0 NaN NaN NaN
18 19.0 NaN NaN NaN
19 20.0 NaN NaN NaN
20 21.0 NaN NaN NaN
21 22.0 NaN NaN NaN
22 23.0 NaN NaN NaN
23 24.0 NaN NaN NaN
24 25.0 NaN NaN NaN
25 26.0 NaN NaN NaN
26 27.0 NaN NaN NaN
27 28.0 NaN NaN NaN
28 29.0 NaN NaN NaN
29 30.0 NaN NaN NaN

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

To get a block of rows around a specific date in Pandas

I need to extract a block (contiguous) of lines around a particular date specified by the presence (not NaN value) in q1 column. By block I mean k days before the date and p days after the date.
For example, using the following dataframe, and setting k=5, p=2, I need to get the following blocks:
participant_id response_date q1 summary
0 11.0 2016-04-27 NaN NaN
1 11.0 2016-04-30 NaN 2.0
2 11.0 2016-05-01 1089.0 3.0
3 11.0 2016-05-02 NaN 3.0
4 11.0 2016-05-03 NaN 3.0
5 11.0 2016-05-04 NaN 3.0
6 11.0 2016-05-05 NaN 3.0
7 11.0 2016-05-06 NaN 3.0
8 11.0 2016-05-07 NaN 4.0
9 11.0 2016-05-08 NaN 4.0
10 11.0 2016-05-09 NaN 3.0
11 11.0 2016-05-10 NaN 3.0
12 11.0 2016-05-11 NaN 3.0
13 11.0 2016-05-12 NaN 3.0
14 11.0 2016-05-13 NaN 3.0
15 11.0 2016-05-14 NaN 3.0
16 11.0 2016-05-15 NaN 3.0
17 11.0 2016-05-16 NaN 3.0
18 11.0 2016-05-17 NaN 4.0
19 11.0 2016-05-18 NaN 3.0
20 11.0 2016-05-19 NaN 3.0
21 11.0 2016-05-20 NaN 3.0
22 11.0 2016-05-21 NaN 4.0
23 11.0 2016-05-22 NaN 4.0
24 11.0 2016-05-23 NaN 4.0
25 11.0 2016-05-24 NaN 3.0
26 11.0 2016-05-25 NaN 3.0
27 11.0 2016-05-26 NaN 3.0
28 11.0 2016-05-27 NaN 3.0
29 11.0 2016-05-28 NaN 3.0
30 11.0 2016-05-29 NaN 3.0
31 11.0 2016-05-30 NaN 3.0
32 11.0 2016-05-31 NaN 4.0
33 11.0 2016-06-01 NaN 4.0
34 11.0 2016-06-02 802.0 3.0
35 11.0 2016-06-03 NaN 3.0
36 11.0 2016-06-04 NaN 3.0
37 11.0 2016-06-05 NaN 3.0
38 11.0 2016-06-06 NaN 3.0
39 11.0 2016-06-07 NaN 3.0
40 11.0 2016-06-08 NaN 3.0
41 11.0 2016-06-09 NaN 3.0
42 11.0 2016-06-10 NaN 3.0
43 11.0 2016-06-11 NaN 5.0
44 11.0 2016-06-12 NaN 3.0
45 11.0 2016-06-13 NaN 4.0
46 11.0 2016-06-14 NaN 4.0
47 11.0 2016-06-15 NaN 3.0
48 11.0 2016-06-16 NaN 3.0
49 11.0 2016-06-17 NaN 3.0
Block 1: (up to 5 days before the date where q1 is not NaN' and 2 days
0 11.0 2016-04-27 NaN NaN
1 11.0 2016-04-30 NaN 2.0
2 11.0 2016-05-01 1089.0 3.0
3 11.0 2016-05-02 NaN 3.0
4 11.0 2016-05-03 NaN 3.0
Block 2:
30 11.0 2016-05-29 NaN 3.0
31 11.0 2016-05-30 NaN 3.0
32 11.0 2016-05-31 NaN 4.0
33 11.0 2016-06-01 NaN 4.0
34 11.0 2016-06-02 802.0 3.0
35 11.0 2016-06-03 NaN 3.0
36 11.0 2016-06-04 NaN 3.0
I have implemented this algorithm in a quite straightforward way, with loops and conditional flows, however, that's pretty slow (for a large data set) and I would like to learn more paythonian/pandasian solution. I anticipate it may involve groupBy function.

Since I do not have the starting code or data, I'll try my best. Given your response_date column is a datetime object
import datetime as dt
dates_not_null = your_df.loc[~your_df.q1.isnull(), 'response_date']
for i in dates_not_null:
req_df = your_df.loc[(i - dt.timedelta(k)) : i + dt.timedelta(p)), :]
you can append this dataframe to a list and then concatenate or do whatever you want.

Using a helper function to get a dict of DataFrames and concatenate them:
from dateutil.relativedelta import relativedelta
def get_block(obj, d, k, p):
# obj -> dataframe; d -> date
start = d - relativedelta(days=k)
end = d + relativedelta(days=p)
obj = obj.set_index('response_date')
return obj.loc[start:end]
dates = df[df.q1.notnull()]['response_date'].tolist()
result = {}
k = 5
p = 2
for d in dates:
result[d] = get_block(df, d, k, p)
print(result[dates[0]])
participant_id q1 summary
response_date
2016-04-27 11 NaN NaN
2016-04-30 11 NaN 2.0
2016-05-01 11 1089.0 3.0
2016-05-02 11 NaN 3.0
2016-05-03 11 NaN 3.0
Then you can just concatenate this result:
result = pd.concat((result))
result.index = result.index.rename(['mid_date', 'response_date']
print(result)
participant_id q1 summary
mid_date response_date
2016-05-01 2016-04-27 11 NaN NaN
2016-04-30 11 NaN 2.0
2016-05-01 11 1089.0 3.0
2016-05-02 11 NaN 3.0
2016-05-03 11 NaN 3.0
2016-06-02 2016-05-28 11 NaN 3.0
2016-05-29 11 NaN 3.0
2016-05-30 11 NaN 3.0
2016-05-31 11 NaN 4.0
2016-06-01 11 NaN 4.0
2016-06-02 11 802.0 3.0
2016-06-03 11 NaN 3.0
2016-06-04 11 NaN 3.0
I think a loop is pretty unavoidable here given that you may have overlapping sub-sections of your input.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupBy with conditional grouping - python

You want the .cut method. This lets you bin your dates by some other list of dates. df1['cuts'] = pd.cut(df1['response_date'], df2['response_date']) grouped = df1.groupby('cuts') print grouped.max() #for example

Related

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

Calculated column with shift

Convert column vector into multi-column matrix

Substraction between two dataframe's column

To get a block of rows around a specific date in Pandas

Categories

Resources