Extract groups of consecutive values having greater than specified size - python

I am trying to find within a dataframe if there are at least X consecutive operations (I already included a column "Filter_OK" that calculates if the row meets the criteria), and extract that group of rows.
TRN TRN_DATE FILTER_OK
0 5153 04/04/2017 11:40:00 True
1 7542 04/04/2017 17:18:00 True
2 875 04/04/2017 20:08:00 True
3 74 05/04/2017 20:30:00 False
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
9 9651 12/04/2017 13:57:00 False
For this example, if I am looking for 4 operations.
OUTPUT DESIRED:
TRN TRN_DATE FILTER_OK
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
How can i subset the operations I need?

You may do this using cumsum, followed by groupby, and transform:
v = (~df.FILTER_OK).cumsum()
df[v.groupby(v).transform('size').ge(4) & df['FILTER_OK']]
TRN TRN_DATE FILTER_OK
4 9652 2017-06-04 20:32:00 True
5 965 2017-07-04 12:52:00 True
6 752 2017-10-04 17:40:00 True
7 9541 2017-10-04 19:29:00 True
8 7452 2017-11-04 12:20:00 True
Details
First, use cumsum to segregate rows into groups:
v = (~df.FILTER_OK).cumsum()
v
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: FILTER_OK, dtype: int64
Next, find the size of each group, and then figure out what groups have at least X rows (in your case, 4):
v.groupby(v).transform('size')
0 3
1 3
2 3
3 6
4 6
5 6
6 6
7 6
8 6
9 1
Name: FILTER_OK, dtype: int64
v.groupby(v).transform('size').ge(4)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
AND this mask with "FILTER_OK" to ensure we only take valid rows that fit the criteria.
v.groupby(v).transform('size').ge(4) & df['FILTER_OK']
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool

This is will also consider 4 consecutive False
s=df.FILTER_OK.astype(int).diff().ne(0).cumsum()
df[s.isin(s.value_counts().loc[lambda x : x>4].index)]
Out[784]:
TRN TRN_DATE FILTER_OK
4 9652 06/04/201720:32:00 True
5 965 07/04/201712:52:00 True
6 752 10/04/201717:40:00 True
7 9541 10/04/201719:29:00 True
8 7452 11/04/201712:20:00 True

One of possible options is to use itertools.groupby called on source
df.values.
An important difference of this method, compared to pd.groupby is
that if groupping key changes, then a new group is created.
So you can try the following code:
import pandas as pd
import itertools
# Source DataFrame
df = pd.DataFrame(data=[
[ 5153, '04/04/2017 11:40:00', True ], [ 7542, '04/04/2017 17:18:00', True ],
[ 875, '04/04/2017 20:08:00', True ], [ 74, '05/04/2017 20:30:00', False ],
[ 9652, '06/04/2017 20:32:00', True ], [ 965, '07/04/2017 12:52:00', True ],
[ 752, '10/04/2017 17:40:00', True ], [ 9541, '10/04/2017 19:29:00', True ],
[ 7452, '11/04/2017 12:20:00', True ], [ 9651, '12/04/2017 13:57:00', False ]],
columns=[ 'TRN', 'TRN_DATE', 'FILTER_OK' ])
# Work list
xx = []
# Collect groups for 'True' key with at least 5 members
for key, group in itertools.groupby(df.values, lambda x: x[2]):
lst = list(group)
if key and len(lst) >= 5:
xx.extend(lst)
# Create result DataFrame with the same column names
df2 = pd.DataFrame(data=xx, columns=df.columns)

This is actually part of a "group by" operation (by CRD Column).
If there are two consecutive groups of rows (Crd 111 and 333), and the second group of rows does not meet the condition (not 4 consecutive True), the first row of the group is included (the bold line), when it shouldn't
CRD TRN TRN_DATE FILTER_OK
0 111 5153 04/04/2017 11:40:00 True
1 111 7542 04/04/2017 17:18:00 True
2 256 875 04/04/2017 20:08:00 True
3 365 74 05/04/2017 20:30:00 False
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
9 333 9651 12/04/2017 13:57:00 False
10 333 961 12/04/2017 13:57:00 False
11 333 871 12/04/2017 13:57:00 False
Actual output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
Desired output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True

Related

How to get a value in a column as an index

I assign the eligible index value to A column and then df.ffill()
Now I want to use the value of A column as an index and assign the obtained value to the expcted column
I try df['expected']=df['price'][df['A']] but it doesn't work.
input
import pandas as pd
import numpy as np
d={'trade_date':['2021-08-10','2021-08-11','2021-08-12','2021-08-13','2021-08-14','2021-08-15','2021-08-16','2021-08-17','2021-08-18','2021-08-19','2021-08-20',],'price':[2,12,8,10,11,18,7,19,9,8,12],'cond':[True,False,True,False,True,False,True,False,True,True,True]}
df = pd.DataFrame(d)
df.index=pd.to_datetime(df.trade_date)
df['A']=df.index.where(df['cond'])
df['A']=df['A'].ffill()
df.to_clipboard()
df
expected result table
trade_date price cond A expected
2021/8/10 2 TRUE 2021/8/10 2
2021/8/11 12 FALSE 2021/8/10 2
2021/8/12 8 TRUE 2021/8/12 8
2021/8/13 10 FALSE 2021/8/12 8
2021/8/14 11 TRUE 2021/8/14 11
2021/8/15 18 FALSE 2021/8/14 11
2021/8/16 7 TRUE 2021/8/16 7
2021/8/17 19 FALSE 2021/8/16 7
2021/8/18 9 TRUE 2021/8/18 9
2021/8/19 8 TRUE 2021/8/19 8
2021/8/20 12 TRUE 2021/8/20 12
Try this:
df['expected'] = df['A'].map(df['price'])
print(df)
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
Alternatively, you could use groupby and transform:
df.assign(expected=df.groupby(['A'])['price'].transform('first'))
-------------------------------------------------
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
-------------------------------------------------
This approach groups by A, takes the first value of price for each group and assigns it to the corresponding group.

How do i count values in multiple columns based on multiple criteria and create a new column row wise?

index
key
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
Average
Count_G10
Count_L10
0
a
12
0
159
0
20
49
0
131
157
153
68.1
4
3
1
b
0
68
195
189
0
79
12
179
21
62
80.5
3
4
2
c
0
139
0
188
12
0
31
87
152
73
68.2
4
2
3
d
126
156
0
112
178
146
0
19
192
25
95.4
6
2
4
e
109
0
172
0
0
0
44
145
186
100
75.6
5
1
5
f
63
183
194
183
0
163
136
13
163
162
126
6
2
6
g
101
143
0
184
0
107
103
0
60
133
83.1
6
1
7
h
13
101
139
86
101
72
93
151
0
0
75.6
6
1
8
i
182
71
73
73
129
32
56
135
0
114
86.5
4
5
9
j
82
0
198
0
117
21
0
32
64
146
66
4
2
10
k
145
0
194
0
156
71
0
89
57
31
74.3
4
2
I would like to get the columns count_G10 and count_L10 where the logic for count_G10 is as follows:
count of months(M1 to M10 columns) where value is >0 and ((value-average)/average) > 0.1
Similarly, count_L10 logic is:
count of months(M1 to M10 columns) where value is >0 and ((value-average)/average) < -0.1
I have tried the following in Pandas:
Oct20_Nov21 = [202010,202011,202012,202101,202102,202103,202104,202105,202106,202107,202108,202109,202110,202111]
new_df3['G10%'] = new_df3[Oct20_Nov21].applymap(lambda x : 1 if (((x-new_df3.avg_w_s)/new_df3.avg_w_s) > 0.1).any() else 0).values.sum(axis=1)
I get the following error:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Let me know what am I missing here. Thank you
Don't use applymap as it is both very slow and will try to perform operations on scalar values and Series leading to the error shown.
Instead it is best to perform the operations at the DataFrame level.
In the most verbose way we could do:
m_df = df.filter(like='M')
df['Count_G10'] = (
m_df.gt(0) &
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0).gt(0.1)
).sum(axis=1)
df['Count_L10'] = (
m_df.gt(0) &
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0).lt(-0.1)
).sum(axis=1)
key
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
Average
Count_G10
Count_L10
a
12
0
159
0
20
49
0
131
157
153
68.1
4
3
b
0
68
195
189
0
79
12
179
21
62
80.5
3
4
c
0
139
0
188
12
0
31
87
152
73
68.2
4
2
d
126
156
0
112
178
146
0
19
192
25
95.4
6
2
e
109
0
172
0
0
0
44
145
186
100
75.6
5
1
f
63
183
194
183
0
163
136
13
163
162
126
6
2
g
101
143
0
184
0
107
103
0
60
133
83.1
6
1
h
13
101
139
86
101
72
93
151
0
0
75.6
6
1
i
182
71
73
73
129
32
56
135
0
114
86.5
4
5
j
82
0
198
0
117
21
0
32
64
146
66
4
2
k
145
0
194
0
156
71
0
89
57
31
74.3
4
2
First filter to select only the M columns (There are many other ways to select only the desired columns which would also work):
m_df = df.filter(like='M')
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 12 0 159 0 20 49 0 131 157 153
1 0 68 195 189 0 79 12 179 21 62
2 0 139 0 188 12 0 31 87 152 73
3 126 156 0 112 178 146 0 19 192 25
4 109 0 172 0 0 0 44 145 186 100
5 63 183 194 183 0 163 136 13 163 162
6 101 143 0 184 0 107 103 0 60 133
7 13 101 139 86 101 72 93 151 0 0
8 182 71 73 73 129 32 56 135 0 114
9 82 0 198 0 117 21 0 32 64 146
10 145 0 194 0 156 71 0 89 57 31
Then use the comparison operations gt and lt to do the value comparison operations.
Step 1: check is for values gt 0:
m_df.gt(0)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 True False True False True True False True True True
1 False True True True False True True True True True
2 False True False True True False True True True True
3 True True False True True True False True True True
4 True False True False False False True True True True
5 True True True True False True True True True True
6 True True False True False True True False True True
7 True True True True True True True True False False
8 True True True True True True True True False True
9 True False True False True True False True True True
10 True False True False True True False True True True
Evaluate: ((value - average) / average). Both operations need to align on axis=0.
Step 2: Subtract
m_df.sub(df['Average'], axis=0)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 -56.1 -68.1 90.9 -68.1 -48.1 -19.1 -68.1 62.9 88.9 84.9
1 -80.5 -12.5 114.5 108.5 -80.5 -1.5 -68.5 98.5 -59.5 -18.5
2 -68.2 70.8 -68.2 119.8 -56.2 -68.2 -37.2 18.8 83.8 4.8
3 30.6 60.6 -95.4 16.6 82.6 50.6 -95.4 -76.4 96.6 -70.4
4 33.4 -75.6 96.4 -75.6 -75.6 -75.6 -31.6 69.4 110.4 24.4
5 -63.0 57.0 68.0 57.0 -126.0 37.0 10.0 -113.0 37.0 36.0
6 17.9 59.9 -83.1 100.9 -83.1 23.9 19.9 -83.1 -23.1 49.9
7 -62.6 25.4 63.4 10.4 25.4 -3.6 17.4 75.4 -75.6 -75.6
8 95.5 -15.5 -13.5 -13.5 42.5 -54.5 -30.5 48.5 -86.5 27.5
9 16.0 -66.0 132.0 -66.0 51.0 -45.0 -66.0 -34.0 -2.0 80.0
10 70.7 -74.3 119.7 -74.3 81.7 -3.3 -74.3 14.7 -17.3 -43.3
Step 3: Divide
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 -0.823789 -1.000000 1.334802 -1.000000 -0.706314 -0.280470 -1.000000 0.923642 1.305433 1.246696
1 -1.000000 -0.155280 1.422360 1.347826 -1.000000 -0.018634 -0.850932 1.223602 -0.739130 -0.229814
2 -1.000000 1.038123 -1.000000 1.756598 -0.824047 -1.000000 -0.545455 0.275660 1.228739 0.070381
3 0.320755 0.635220 -1.000000 0.174004 0.865828 0.530398 -1.000000 -0.800839 1.012579 -0.737945
4 0.441799 -1.000000 1.275132 -1.000000 -1.000000 -1.000000 -0.417989 0.917989 1.460317 0.322751
5 -0.500000 0.452381 0.539683 0.452381 -1.000000 0.293651 0.079365 -0.896825 0.293651 0.285714
6 0.215403 0.720818 -1.000000 1.214200 -1.000000 0.287605 0.239471 -1.000000 -0.277978 0.600481
7 -0.828042 0.335979 0.838624 0.137566 0.335979 -0.047619 0.230159 0.997354 -1.000000 -1.000000
8 1.104046 -0.179191 -0.156069 -0.156069 0.491329 -0.630058 -0.352601 0.560694 -1.000000 0.317919
9 0.242424 -1.000000 2.000000 -1.000000 0.772727 -0.681818 -1.000000 -0.515152 -0.030303 1.212121
10 0.951548 -1.000000 1.611036 -1.000000 1.099596 -0.044415 -1.000000 0.197847 -0.232840 -0.582773
Step 4: Compare (gt or lt depending)
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0).gt(0.1)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 False False True False False False False True True True
1 False False True True False False False True False False
2 False True False True False False False True True False
3 True True False True True True False False True False
4 True False True False False False False True True True
5 False True True True False True False False True True
6 True True False True False True True False False True
7 False True True True True False True True False False
8 True False False False True False False True False True
9 True False True False True False False False False True
10 True False True False True False False True False False
Step 5: Find where both conditions are True with logical AND (&)
(
m_df.gt(0) &
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0).gt(0.1)
)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0 False False True False False False False True True True
1 False False True True False False False True False False
2 False True False True False False False True True False
3 True True False True True True False False True False
4 True False True False False False False True True True
5 False True True True False True False False True True
6 True True False True False True True False False True
7 False True True True True False True True False False
8 True False False False True False False True False True
9 True False True False True False False False False True
10 True False True False True False False True False False
Step 6: Count the number of True values in each row with sum (True is 1 and False is 0 (the additive identity) which is why sum works to count number of True values)
(
m_df.gt(0) &
m_df.sub(df['Average'], axis=0).div(df['Average'], axis=0).gt(0.1)
).sum(axis=1)
0 4
1 3
2 4
3 6
4 5
5 6
6 6
7 6
8 4
9 4
10 4
dtype: int64
An almost identical process occurs for Count_L10 the only difference is checking .lt(-0.1) instead of .gt(0.1).
This operation can be greatly simplified by extracting and reusing common operations and refactoring the expression:
m_df = df.filter(like='M')
# Shared Condition
m = m_df.gt(0)
# Values
v = m_df.div(df['Average'], axis=0) - 1
df['Count_G10'] = (m & v.gt(0.1)).sum(axis=1)
df['Count_L10'] = (m & v.lt(-0.1)).sum(axis=1)
Both conditions use the check for values greater than 0 so we can keep that in a variable (m) to use multiple times. Both expressions compare against the same expression ((value - average) / average) we can also store this in a variable v.
The expression ((value - average) / average) can also be simplified to just ((value / average) - 1).
Since ((v - a) / a) = ((v/a) - (a / a)) = ((v/a) - 1)
This will reduce overall computation time, at the expense of some readability, but produces the same results:
key
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
Average
Count_G10
Count_L10
a
12
0
159
0
20
49
0
131
157
153
68.1
4
3
b
0
68
195
189
0
79
12
179
21
62
80.5
3
4
c
0
139
0
188
12
0
31
87
152
73
68.2
4
2
d
126
156
0
112
178
146
0
19
192
25
95.4
6
2
e
109
0
172
0
0
0
44
145
186
100
75.6
5
1
f
63
183
194
183
0
163
136
13
163
162
126
6
2
g
101
143
0
184
0
107
103
0
60
133
83.1
6
1
h
13
101
139
86
101
72
93
151
0
0
75.6
6
1
i
182
71
73
73
129
32
56
135
0
114
86.5
4
5
j
82
0
198
0
117
21
0
32
64
146
66
4
2
k
145
0
194
0
156
71
0
89
57
31
74.3
4
2
Setup used:
import pandas as pd
df = pd.DataFrame({
'key': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'],
'M1': [12, 0, 0, 126, 109, 63, 101, 13, 182, 82, 145],
'M2': [0, 68, 139, 156, 0, 183, 143, 101, 71, 0, 0],
'M3': [159, 195, 0, 0, 172, 194, 0, 139, 73, 198, 194],
'M4': [0, 189, 188, 112, 0, 183, 184, 86, 73, 0, 0],
'M5': [20, 0, 12, 178, 0, 0, 0, 101, 129, 117, 156],
'M6': [49, 79, 0, 146, 0, 163, 107, 72, 32, 21, 71],
'M7': [0, 12, 31, 0, 44, 136, 103, 93, 56, 0, 0],
'M8': [131, 179, 87, 19, 145, 13, 0, 151, 135, 32, 89],
'M9': [157, 21, 152, 192, 186, 163, 60, 0, 0, 64, 57],
'M10': [153, 62, 73, 25, 100, 162, 133, 0, 114, 146, 31],
'Average': [68.1, 80.5, 68.2, 95.4, 75.6, 126.0, 83.1, 75.6, 86.5, 66.0,
74.3]
})

Pandas - Compare value current year to those of all previous years by group and return True if it is the Cumulative Minimum

I have a Pandas dataframe with the following structure
id date num
243 2014-12-01 3
234 2014-12-01 2
243 2015-12-01 2
234 2016-12-01 4
243 2016-12-01 6
234 2017-12-01 5
243 2018-12-01 7
234 2018-12-01 10
243 2019-12-01 1
234 2019-12-01 12
243 2020-12-01 15
234 2020-12-01 5
I want to add another column that compares the field num by id if it is smaller than any value in previous years (for each id). For example, id 243 and date 2019-12-01 has value 1. In this case the new field flag will assume True because no value in previous years were smaller for the id 243. The expected dataframe should look like the one below:
id date num flag
243 2014-12-01 3 -
234 2014-12-01 2 -
243 2015-12-01 2 True
234 2016-12-01 4 False
243 2016-12-01 6 False
234 2017-12-01 5 False
243 2018-12-01 7 False
234 2018-12-01 10 False
243 2019-12-01 1 True
234 2019-12-01 12 False
243 2020-12-01 15 False
234 2020-12-01 5 False
I am stuck in finding a solution that allows me to compare each row to those of previous years. Any suggestion how to compare each row value to those in years before?
Thanks
Use .cummin to get the cumulative minimum by group
Use .cumcount to return the first value of each group as - with np.where
df['flag'] = (df['num'] == df.groupby(['id'])['num'].transform('cummin'))
df['flag'] = np.where(df.groupby('id').cumcount() == 0, '-', df['flag'])
df
Out[1]:
id date num flag
0 243 2014-12-01 3 -
1 234 2014-12-01 2 -
2 243 2015-12-01 2 True
3 234 2016-12-01 4 False
4 243 2016-12-01 6 False
5 234 2017-12-01 5 False
6 243 2018-12-01 7 False
7 234 2018-12-01 10 False
8 243 2019-12-01 1 True
9 234 2019-12-01 12 False
10 243 2020-12-01 15 False
11 234 2020-12-01 5 False
Minor note: Instead of np.where(), you can also use:
df['flag'] = df['flag'].where(df.groupby('id').cumcount() != 0, '-')
which essentially does the exact same thing.
In one of line of code:
(df.num == df.groupby('id').num.cummin()).where(df.groupby('id').cumcount() != 0, '-')
Let us use cummin + duplicated with where
(df['num']==df.groupby('id')['num'].cummin()).where(df.id.duplicated(),'-')
0 -
1 -
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
Name: num, dtype: object

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.
Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9
I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32
Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Multiindex after Groupby and apply if only one group

I am trying to set True or False if some rows (grouped by 'trn_crd_no' and 'loc_code') meet a condition (difference between operations is less than 5 minutes).
Everything goes fine if there are more than one group, but failes when there is only one group ['trn_crd_no', 'loc_code']
BBDD_Patron1:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 28/05/2019 10:29 20004 1111 32
1 2 28/05/2019 10:30 20004 1111 434
2 3 28/05/2019 10:35 20004 1111 24
3 4 28/05/2019 10:37 20004 1111 6453
4 5 28/05/2019 10:39 20004 1111 5454
5 6 28/05/2019 10:40 20004 1111 2132
6 7 28/05/2019 10:41 20004 1111 45
7 8 28/05/2019 13:42 20007 2222 867
8 9 28/05/2019 13:47 20007 2222 765
9 19 28/05/2019 13:54 20007 2222 2334
10 11 28/05/2019 13:56 20007 2222 3454
11 12 28/05/2019 14:03 20007 2222 23
12 13 28/05/2019 15:40 20007 2222 534
13 14 28/05/2019 15:45 20007 2222 13
14 15 28/05/2019 17:05 20007 2222 765
15 16 28/05/2019 17:08 20007 2222 87
16 17 28/05/2019 14:07 10003 2222 4526
#Set trn_date is datetime
BBDD_Patron1['trn_date'] = pd.to_datetime(BBDD_Patron1['trn_date'])
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], as_index=False).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
aux:
0 0 True
1 False
2 False
3 False
4 False
5 False
6 False
1 16 True
2 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Create a new DF copy from the fisrt one, and include the new column with Boolean values
BBDD_Patron1_v = BBDD_Patron1.copy()
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
Results as expected.
BBDD_Patron1_v:
trn_id trn_date loc_code trn_crd_no prd_acc_no consec
0 1 2019-05-28 10:29:00 20004 1111 32 True
1 2 2019-05-28 10:30:00 20004 1111 434 False
2 3 2019-05-28 10:35:00 20004 1111 24 False
3 4 2019-05-28 10:37:00 20004 1111 6453 False
4 5 2019-05-28 10:39:00 20004 1111 5454 False
5 6 2019-05-28 10:40:00 20004 1111 2132 False
6 7 2019-05-28 10:41:00 20004 1111 45 False
7 8 2019-05-28 13:42:00 20007 2222 867 True
8 9 2019-05-28 13:47:00 20007 2222 765 False
9 19 2019-05-28 13:54:00 20007 2222 2334 False
10 11 2019-05-28 13:56:00 20007 2222 3454 False
11 12 2019-05-28 14:03:00 20007 2222 23 False
12 13 2019-05-28 15:40:00 20007 2222 534 False
13 14 2019-05-28 15:45:00 20007 2222 13 False
14 15 2019-05-28 17:05:00 20007 2222 765 False
15 16 2019-05-28 17:08:00 20007 2222 87 False
16 17 2019-05-28 14:07:00 10003 2222 4526 True
PROBLEM: If I have only one group after the groupby:
BBDD_2:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 2019-05-28 10:29:00 20004 1111 32
1 2 2019-05-28 10:30:00 20004 1111 434
2 3 2019-05-28 10:35:00 20004 1111 24
3 4 2019-05-28 10:37:00 20004 1111 6453
4 5 2019-05-28 10:39:00 20004 1111 5454
5 6 2019-05-28 10:40:00 20004 1111 2132
6 7 2019-05-28 10:41:00 20004 1111 45
aux2:
trn_date 0 1 2 3 4 5 6
trn_crd_no loc_code
1111 20004 True False False False False False False
Since the strcutrue of aux is different, I get an error with the following line:
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
ValueError: Wrong number of items passed 7, placement implies 1
I am trying also to set squeeze=True, but it also gives different structure, so I cannot copy into BBDD_Patron1 the Boolean values.
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], squeeze=True).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
Results when more than one group. Aux =
trn_crd_no loc_code
1111 20004 0 True
1 False
2 False
3 False
4 False
5 False
6 False
2222 10003 16 True
20007 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Results when only one group. Aux2 =
0 True
1 False
2 False
3 False
4 False
5 False
6 False

Categories

Resources