I am having issues finding a solution for the cummulative sum for mtd and ytd
I need help to get this result
Use groupby.cumsum combined with periods using to_period:
# ensure datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# group by year
df['ytd'] = df.groupby(s.dt.to_period('Y'))['count'].cumsum()
# group by month
df['mtd'] = df.groupby(s.dt.to_period('M'))['count'].cumsum()
Example (with dummy data):
date count ytd mtd
0 2022-08-26 6 6 6
1 2022-08-27 1 7 7
2 2022-08-28 4 11 11
3 2022-08-29 4 15 15
4 2022-08-30 8 23 23
5 2022-08-31 4 27 27
6 2022-09-01 6 33 6
7 2022-09-02 3 36 9
8 2022-09-03 5 41 14
9 2022-09-04 8 49 22
10 2022-09-05 7 56 29
11 2022-09-06 9 65 38
12 2022-09-07 9 74 47
I am trying to set True or False if some rows (grouped by 'trn_crd_no' and 'loc_code') meet a condition (difference between operations is less than 5 minutes).
Everything goes fine if there are more than one group, but failes when there is only one group ['trn_crd_no', 'loc_code']
BBDD_Patron1:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 28/05/2019 10:29 20004 1111 32
1 2 28/05/2019 10:30 20004 1111 434
2 3 28/05/2019 10:35 20004 1111 24
3 4 28/05/2019 10:37 20004 1111 6453
4 5 28/05/2019 10:39 20004 1111 5454
5 6 28/05/2019 10:40 20004 1111 2132
6 7 28/05/2019 10:41 20004 1111 45
7 8 28/05/2019 13:42 20007 2222 867
8 9 28/05/2019 13:47 20007 2222 765
9 19 28/05/2019 13:54 20007 2222 2334
10 11 28/05/2019 13:56 20007 2222 3454
11 12 28/05/2019 14:03 20007 2222 23
12 13 28/05/2019 15:40 20007 2222 534
13 14 28/05/2019 15:45 20007 2222 13
14 15 28/05/2019 17:05 20007 2222 765
15 16 28/05/2019 17:08 20007 2222 87
16 17 28/05/2019 14:07 10003 2222 4526
#Set trn_date is datetime
BBDD_Patron1['trn_date'] = pd.to_datetime(BBDD_Patron1['trn_date'])
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], as_index=False).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
aux:
0 0 True
1 False
2 False
3 False
4 False
5 False
6 False
1 16 True
2 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Create a new DF copy from the fisrt one, and include the new column with Boolean values
BBDD_Patron1_v = BBDD_Patron1.copy()
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
Results as expected.
BBDD_Patron1_v:
trn_id trn_date loc_code trn_crd_no prd_acc_no consec
0 1 2019-05-28 10:29:00 20004 1111 32 True
1 2 2019-05-28 10:30:00 20004 1111 434 False
2 3 2019-05-28 10:35:00 20004 1111 24 False
3 4 2019-05-28 10:37:00 20004 1111 6453 False
4 5 2019-05-28 10:39:00 20004 1111 5454 False
5 6 2019-05-28 10:40:00 20004 1111 2132 False
6 7 2019-05-28 10:41:00 20004 1111 45 False
7 8 2019-05-28 13:42:00 20007 2222 867 True
8 9 2019-05-28 13:47:00 20007 2222 765 False
9 19 2019-05-28 13:54:00 20007 2222 2334 False
10 11 2019-05-28 13:56:00 20007 2222 3454 False
11 12 2019-05-28 14:03:00 20007 2222 23 False
12 13 2019-05-28 15:40:00 20007 2222 534 False
13 14 2019-05-28 15:45:00 20007 2222 13 False
14 15 2019-05-28 17:05:00 20007 2222 765 False
15 16 2019-05-28 17:08:00 20007 2222 87 False
16 17 2019-05-28 14:07:00 10003 2222 4526 True
PROBLEM: If I have only one group after the groupby:
BBDD_2:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 2019-05-28 10:29:00 20004 1111 32
1 2 2019-05-28 10:30:00 20004 1111 434
2 3 2019-05-28 10:35:00 20004 1111 24
3 4 2019-05-28 10:37:00 20004 1111 6453
4 5 2019-05-28 10:39:00 20004 1111 5454
5 6 2019-05-28 10:40:00 20004 1111 2132
6 7 2019-05-28 10:41:00 20004 1111 45
aux2:
trn_date 0 1 2 3 4 5 6
trn_crd_no loc_code
1111 20004 True False False False False False False
Since the strcutrue of aux is different, I get an error with the following line:
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
ValueError: Wrong number of items passed 7, placement implies 1
I am trying also to set squeeze=True, but it also gives different structure, so I cannot copy into BBDD_Patron1 the Boolean values.
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], squeeze=True).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
Results when more than one group. Aux =
trn_crd_no loc_code
1111 20004 0 True
1 False
2 False
3 False
4 False
5 False
6 False
2222 10003 16 True
20007 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Results when only one group. Aux2 =
0 True
1 False
2 False
3 False
4 False
5 False
6 False
I am trying to find within a dataframe if there are at least X consecutive operations (I already included a column "Filter_OK" that calculates if the row meets the criteria), and extract that group of rows.
TRN TRN_DATE FILTER_OK
0 5153 04/04/2017 11:40:00 True
1 7542 04/04/2017 17:18:00 True
2 875 04/04/2017 20:08:00 True
3 74 05/04/2017 20:30:00 False
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
9 9651 12/04/2017 13:57:00 False
For this example, if I am looking for 4 operations.
OUTPUT DESIRED:
TRN TRN_DATE FILTER_OK
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
How can i subset the operations I need?
You may do this using cumsum, followed by groupby, and transform:
v = (~df.FILTER_OK).cumsum()
df[v.groupby(v).transform('size').ge(4) & df['FILTER_OK']]
TRN TRN_DATE FILTER_OK
4 9652 2017-06-04 20:32:00 True
5 965 2017-07-04 12:52:00 True
6 752 2017-10-04 17:40:00 True
7 9541 2017-10-04 19:29:00 True
8 7452 2017-11-04 12:20:00 True
Details
First, use cumsum to segregate rows into groups:
v = (~df.FILTER_OK).cumsum()
v
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: FILTER_OK, dtype: int64
Next, find the size of each group, and then figure out what groups have at least X rows (in your case, 4):
v.groupby(v).transform('size')
0 3
1 3
2 3
3 6
4 6
5 6
6 6
7 6
8 6
9 1
Name: FILTER_OK, dtype: int64
v.groupby(v).transform('size').ge(4)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
AND this mask with "FILTER_OK" to ensure we only take valid rows that fit the criteria.
v.groupby(v).transform('size').ge(4) & df['FILTER_OK']
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
This is will also consider 4 consecutive False
s=df.FILTER_OK.astype(int).diff().ne(0).cumsum()
df[s.isin(s.value_counts().loc[lambda x : x>4].index)]
Out[784]:
TRN TRN_DATE FILTER_OK
4 9652 06/04/201720:32:00 True
5 965 07/04/201712:52:00 True
6 752 10/04/201717:40:00 True
7 9541 10/04/201719:29:00 True
8 7452 11/04/201712:20:00 True
One of possible options is to use itertools.groupby called on source
df.values.
An important difference of this method, compared to pd.groupby is
that if groupping key changes, then a new group is created.
So you can try the following code:
import pandas as pd
import itertools
# Source DataFrame
df = pd.DataFrame(data=[
[ 5153, '04/04/2017 11:40:00', True ], [ 7542, '04/04/2017 17:18:00', True ],
[ 875, '04/04/2017 20:08:00', True ], [ 74, '05/04/2017 20:30:00', False ],
[ 9652, '06/04/2017 20:32:00', True ], [ 965, '07/04/2017 12:52:00', True ],
[ 752, '10/04/2017 17:40:00', True ], [ 9541, '10/04/2017 19:29:00', True ],
[ 7452, '11/04/2017 12:20:00', True ], [ 9651, '12/04/2017 13:57:00', False ]],
columns=[ 'TRN', 'TRN_DATE', 'FILTER_OK' ])
# Work list
xx = []
# Collect groups for 'True' key with at least 5 members
for key, group in itertools.groupby(df.values, lambda x: x[2]):
lst = list(group)
if key and len(lst) >= 5:
xx.extend(lst)
# Create result DataFrame with the same column names
df2 = pd.DataFrame(data=xx, columns=df.columns)
This is actually part of a "group by" operation (by CRD Column).
If there are two consecutive groups of rows (Crd 111 and 333), and the second group of rows does not meet the condition (not 4 consecutive True), the first row of the group is included (the bold line), when it shouldn't
CRD TRN TRN_DATE FILTER_OK
0 111 5153 04/04/2017 11:40:00 True
1 111 7542 04/04/2017 17:18:00 True
2 256 875 04/04/2017 20:08:00 True
3 365 74 05/04/2017 20:30:00 False
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
9 333 9651 12/04/2017 13:57:00 False
10 333 961 12/04/2017 13:57:00 False
11 333 871 12/04/2017 13:57:00 False
Actual output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
Desired output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
I have a large DataFrame with many groups.
What I want to do is iterate over each group, and depending on if a certain condition is met, I want to sum up values for that group.
My DataFrame looks something like this:
Item_Num Price_Change Unit_Sales
10 True 10
10 False 15
10 False 11
10 False 13
12 True 10
12 False 11
12 False 14
12 True 11
12 False 11
For each group of Item_Num, I want to record the sum of unit sales when there was a price change from that row and on until there is another price change. So, I want results like this:
0 Item_Num Price_Change Unit_Sales Sum
1 10 True 10 49
2 10 False 15
3 10 False 11
4 10 False 13
5 12 True 10 34
6 12 False 11
7 12 False 14
8 12 True 11 22
9 12 False 11
(So I'm getting the sum of 49 by summing rows 1 through 4, getting sum of 34 by summing rows 5-7, and getting sum 22 by summing rows 8 and 9).
Here's what I have so far (sketch):
for name, group in new.groupby('UPC'):
if ['Price_Change'] == True:
sum(unit_sales until next price change)
What's the best way to iterate through each group (can my method be improved) and how can I select the row where Price_Change == True?
Very close to your previous question :-)
df['New']=df.groupby([df['Item_Num'],df['Price_Change'].cumsum()])['Unit_Sales'].transform('sum')
df
Out[15]:
Item_Num Price_Change Unit_Sales New
0 10 True 10 49
1 10 False 15 49
2 10 False 11 49
3 10 False 13 49
4 12 True 10 35
5 12 False 11 35
6 12 False 14 35
7 12 True 11 22
8 12 False 11 22
df.New=df.New.where(df['Price_Change'],'')
df
Out[17]:
Item_Num Price_Change Unit_Sales New
0 10 True 10 49
1 10 False 15
2 10 False 11
3 10 False 13
4 12 True 10 35
5 12 False 11
6 12 False 14
7 12 True 11 22
8 12 False 11
I have the following dataframe (df):
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
I want to duplicate the rows for which IMP_START_TIME_BIN is less than IMP_CLR_TIME_BIN as many times as the absolute difference of IMP_START_TIME_BIN and IMP_CLR_TIME_BIN and then append (at the end of the data frame) or preferable append below that row while incrementing the value of IMP_START_TIME_BIN.
For example, for row 3, the difference is 10 and thus I should append 10 rows in the data frame incrementing the value in the IMP_START_TIME_BIN from 8(excluding) to 18(including).
The result should look like this:
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
4 58641876 04:01:00 09:02:00 9 18
... ... ... ... ... ...
13 58641876 04:01:00 09:02:00 18 18
For this I tried to do the following but it didn't work :
for i in range(len(df)):
if df.ix[i,3] < df.ix[i,4]:
for j in range(df.ix[i,3]+1, df.ix[i,4]+1):
df = df.append((df.set_value(i,'IMP_START_TIME_BIN',j))*abs(df.ix[i,3] - df.ix[i,4]))
How can I do it ?
You can use this solution, only necessary index values has to be unique:
#first filter only values for repeating
l = df['IMP_CLR_TIME_BIN'] - df['IMP_START_TIME_BIN']
l = l[l > 0]
print (l)
3 10
dtype: int64
#repeat rows by repeating index values
df1 = df.loc[np.repeat(l.index.values,l.values)].copy()
#add counter to column IMP_START_TIME_BIN
#better explanation http://stackoverflow.com/a/43518733/2901002
a = pd.Series(df1.index == df1.index.to_series().shift())
b = a.cumsum()
a = b.sub(b.mask(a).ffill().fillna(0).astype(int)).add(1)
df1['IMP_START_TIME_BIN'] = df1['IMP_START_TIME_BIN'] + a.values
#append to original df, if necessary sort
df = df.append(df1, ignore_index=True).sort_values('SERV_OR_IOR_ID')
print (df)
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN \
0 -1447310116 23:59:00 00:11:00 47
1 1673545041 00:00:00 00:01:00 0
2 -743717696 23:59:00 00:00:00 47
3 58641876 04:01:00 09:02:00 8
4 58641876 04:01:00 09:02:00 9
5 58641876 04:01:00 09:02:00 10
6 58641876 04:01:00 09:02:00 11
7 58641876 04:01:00 09:02:00 12
8 58641876 04:01:00 09:02:00 13
9 58641876 04:01:00 09:02:00 14
10 58641876 04:01:00 09:02:00 15
11 58641876 04:01:00 09:02:00 16
12 58641876 04:01:00 09:02:00 17
13 58641876 04:01:00 09:02:00 18
IMP_CLR_TIME_BIN
0 0
1 0
2 0
3 18
4 18
5 18
6 18
7 18
8 18
9 18
10 18
11 18
12 18
13 18