Multiindex after Groupby and apply if only one group - python

I am trying to set True or False if some rows (grouped by 'trn_crd_no' and 'loc_code') meet a condition (difference between operations is less than 5 minutes).
Everything goes fine if there are more than one group, but failes when there is only one group ['trn_crd_no', 'loc_code']
BBDD_Patron1:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 28/05/2019 10:29 20004 1111 32
1 2 28/05/2019 10:30 20004 1111 434
2 3 28/05/2019 10:35 20004 1111 24
3 4 28/05/2019 10:37 20004 1111 6453
4 5 28/05/2019 10:39 20004 1111 5454
5 6 28/05/2019 10:40 20004 1111 2132
6 7 28/05/2019 10:41 20004 1111 45
7 8 28/05/2019 13:42 20007 2222 867
8 9 28/05/2019 13:47 20007 2222 765
9 19 28/05/2019 13:54 20007 2222 2334
10 11 28/05/2019 13:56 20007 2222 3454
11 12 28/05/2019 14:03 20007 2222 23
12 13 28/05/2019 15:40 20007 2222 534
13 14 28/05/2019 15:45 20007 2222 13
14 15 28/05/2019 17:05 20007 2222 765
15 16 28/05/2019 17:08 20007 2222 87
16 17 28/05/2019 14:07 10003 2222 4526
#Set trn_date is datetime
BBDD_Patron1['trn_date'] = pd.to_datetime(BBDD_Patron1['trn_date'])
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], as_index=False).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
aux:
0 0 True
1 False
2 False
3 False
4 False
5 False
6 False
1 16 True
2 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Create a new DF copy from the fisrt one, and include the new column with Boolean values
BBDD_Patron1_v = BBDD_Patron1.copy()
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
Results as expected.
BBDD_Patron1_v:
trn_id trn_date loc_code trn_crd_no prd_acc_no consec
0 1 2019-05-28 10:29:00 20004 1111 32 True
1 2 2019-05-28 10:30:00 20004 1111 434 False
2 3 2019-05-28 10:35:00 20004 1111 24 False
3 4 2019-05-28 10:37:00 20004 1111 6453 False
4 5 2019-05-28 10:39:00 20004 1111 5454 False
5 6 2019-05-28 10:40:00 20004 1111 2132 False
6 7 2019-05-28 10:41:00 20004 1111 45 False
7 8 2019-05-28 13:42:00 20007 2222 867 True
8 9 2019-05-28 13:47:00 20007 2222 765 False
9 19 2019-05-28 13:54:00 20007 2222 2334 False
10 11 2019-05-28 13:56:00 20007 2222 3454 False
11 12 2019-05-28 14:03:00 20007 2222 23 False
12 13 2019-05-28 15:40:00 20007 2222 534 False
13 14 2019-05-28 15:45:00 20007 2222 13 False
14 15 2019-05-28 17:05:00 20007 2222 765 False
15 16 2019-05-28 17:08:00 20007 2222 87 False
16 17 2019-05-28 14:07:00 10003 2222 4526 True
PROBLEM: If I have only one group after the groupby:
BBDD_2:
trn_id trn_date loc_code trn_crd_no prd_acc_no
0 1 2019-05-28 10:29:00 20004 1111 32
1 2 2019-05-28 10:30:00 20004 1111 434
2 3 2019-05-28 10:35:00 20004 1111 24
3 4 2019-05-28 10:37:00 20004 1111 6453
4 5 2019-05-28 10:39:00 20004 1111 5454
5 6 2019-05-28 10:40:00 20004 1111 2132
6 7 2019-05-28 10:41:00 20004 1111 45
aux2:
trn_date 0 1 2 3 4 5 6
trn_crd_no loc_code
1111 20004 True False False False False False False
Since the strcutrue of aux is different, I get an error with the following line:
BBDD_Patron1_v['consec'] = aux.reset_index(level=0, drop=True)
ValueError: Wrong number of items passed 7, placement implies 1
I am trying also to set squeeze=True, but it also gives different structure, so I cannot copy into BBDD_Patron1 the Boolean values.
aux = BBDD_Patron1.groupby(['trn_crd_no', 'loc_code'], squeeze=True).apply(lambda x: x.trn_date.diff().fillna(0).abs() < pd.Timedelta(5))
Results when more than one group. Aux =
trn_crd_no loc_code
1111 20004 0 True
1 False
2 False
3 False
4 False
5 False
6 False
2222 10003 16 True
20007 7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
Results when only one group. Aux2 =
0 True
1 False
2 False
3 False
4 False
5 False
6 False

Related

How to get a value in a column as an index

I assign the eligible index value to A column and then df.ffill()
Now I want to use the value of A column as an index and assign the obtained value to the expcted column
I try df['expected']=df['price'][df['A']] but it doesn't work.
input
import pandas as pd
import numpy as np
d={'trade_date':['2021-08-10','2021-08-11','2021-08-12','2021-08-13','2021-08-14','2021-08-15','2021-08-16','2021-08-17','2021-08-18','2021-08-19','2021-08-20',],'price':[2,12,8,10,11,18,7,19,9,8,12],'cond':[True,False,True,False,True,False,True,False,True,True,True]}
df = pd.DataFrame(d)
df.index=pd.to_datetime(df.trade_date)
df['A']=df.index.where(df['cond'])
df['A']=df['A'].ffill()
df.to_clipboard()
df
expected result table
trade_date price cond A expected
2021/8/10 2 TRUE 2021/8/10 2
2021/8/11 12 FALSE 2021/8/10 2
2021/8/12 8 TRUE 2021/8/12 8
2021/8/13 10 FALSE 2021/8/12 8
2021/8/14 11 TRUE 2021/8/14 11
2021/8/15 18 FALSE 2021/8/14 11
2021/8/16 7 TRUE 2021/8/16 7
2021/8/17 19 FALSE 2021/8/16 7
2021/8/18 9 TRUE 2021/8/18 9
2021/8/19 8 TRUE 2021/8/19 8
2021/8/20 12 TRUE 2021/8/20 12
Try this:
df['expected'] = df['A'].map(df['price'])
print(df)
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
Alternatively, you could use groupby and transform:
df.assign(expected=df.groupby(['A'])['price'].transform('first'))
-------------------------------------------------
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
-------------------------------------------------
This approach groups by A, takes the first value of price for each group and assigns it to the corresponding group.

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.
Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9
I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32
Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Mark duplicates based on time difference between successive rows

There are duplicated transactions in a bank dataframe(DF). ID is customer IDs. Duplicated transaction is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span (2 minutes here).
DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])
ID Dollar transactionDateTime
0 111 1 2016-01-08 19:04:50
1 111 3 2016-01-29 19:03:55
2 111 1 2016-01-08 19:05:50
3 111 10 2016-01-08 20:08:50
4 222 25 2016-01-08 19:04:50
5 222 8 2016-02-08 19:04:50
6 222 25 2016-03-08 19:04:50
7 333 9 2016-01-08 19:04:50
8 333 20 2016-03-08 19:05:53
9 333 9 2016-01-08 19:03:20
10 333 9 2016-01-08 19:02:15
11 111 10 2016-02-08 20:08:50
I want to add a column to my dataframe, which recognizes the duplicated transactions (dollar amount of same customer ID should be the same, and transaction date time should be less than 2 minutes). Please consider the first transaction to be "normal".
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 10 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 Yes
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 No
11 111 10 2016-02-08 20:08:50 No
IIUC, you can groupby and diff to check whether the difference between successive transactions is less than 120 seconds:
df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
.groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
.diff()
.dt.total_seconds()
.lt(120))
df
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 False
1 111 3 2016-01-29 19:03:55 False
2 111 1 2016-01-08 19:05:50 True
3 111 100 2016-01-08 20:08:50 False
4 222 25 2016-01-08 19:04:50 False
5 222 8 2016-02-08 19:04:50 False
6 222 25 2016-03-08 19:04:50 False
7 333 9 2016-01-08 19:04:50 True
8 333 20 2016-03-08 19:05:53 False
9 333 9 2016-01-08 19:03:20 True
10 333 9 2016-01-08 19:02:15 False
11 111 100 2016-02-08 20:08:50 False
Note that your data isn't sorted, so you must sort it first to get a meaningful result.
You can use:
m=(DF.groupby('customerID')['transactionDateTime'].diff()/ np.timedelta64(1, 'm')).le(2)
DF['Duplicated?']=np.where((DF.Dollar.duplicated()&m),'Yes','No')
print(DF)
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 100 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 No
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 Yes
11 111 100 2016-02-08 20:08:50 No
We can first mark the duplicate payments in your Dollar column. Then mark per customer if the difference is less then 2 minutes:
DF.sort_values(['customerID', 'transactionDateTime'], inplace=True)
m1 = DF.groupby('customerID', sort=False)['Dollar'].apply(lambda x: x.duplicated())
m2 = DF.groupby('customerID', sort=False)['transactionDateTime'].diff() <= pd.Timedelta(2, unit='minutes')
DF['Duplicated?'] = np.where(m1 & m2, 'Yes', 'No')
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 1 2016-01-08 19:05:50 Yes
2 111 100 2016-01-08 20:08:50 No
3 111 3 2016-01-29 19:03:55 No
4 111 100 2016-02-08 20:08:50 No
5 222 25 2016-01-08 19:04:50 No
6 222 8 2016-02-08 19:04:50 No
7 222 25 2016-03-08 19:04:50 No
8 333 9 2016-01-08 19:02:15 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:04:50 Yes
11 333 20 2016-03-08 19:05:53 No
I made pd.Timedelta(minutes=2) to compare against the diff()
m2 = pd.Timedelta(minutes=2)
DF['dup'] = DF.sort_values('transactionDateTime').groupby(['Dollar','ID']).transactionDateTime.diff().abs().le(m2).astype(int)
Out[272]:
Dollar ID transactionDateTime dup
0 1 111 2016-01-08 19:04:50 0
1 3 111 2016-01-29 19:03:55 0
2 1 111 2016-01-08 19:05:50 1
3 100 111 2016-01-08 20:08:50 0
4 25 222 2016-01-08 19:04:50 0
5 8 222 2016-02-08 19:04:50 0
6 25 222 2016-03-08 19:04:50 0
7 9 333 2016-01-08 19:04:50 1
8 20 333 2016-03-08 19:05:53 0
9 9 333 2016-01-08 19:03:20 1
10 9 333 2016-01-08 19:02:15 0
11 100 111 2016-02-08 20:08:50 0

Extract groups of consecutive values having greater than specified size

I am trying to find within a dataframe if there are at least X consecutive operations (I already included a column "Filter_OK" that calculates if the row meets the criteria), and extract that group of rows.
TRN TRN_DATE FILTER_OK
0 5153 04/04/2017 11:40:00 True
1 7542 04/04/2017 17:18:00 True
2 875 04/04/2017 20:08:00 True
3 74 05/04/2017 20:30:00 False
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
9 9651 12/04/2017 13:57:00 False
For this example, if I am looking for 4 operations.
OUTPUT DESIRED:
TRN TRN_DATE FILTER_OK
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
How can i subset the operations I need?
You may do this using cumsum, followed by groupby, and transform:
v = (~df.FILTER_OK).cumsum()
df[v.groupby(v).transform('size').ge(4) & df['FILTER_OK']]
TRN TRN_DATE FILTER_OK
4 9652 2017-06-04 20:32:00 True
5 965 2017-07-04 12:52:00 True
6 752 2017-10-04 17:40:00 True
7 9541 2017-10-04 19:29:00 True
8 7452 2017-11-04 12:20:00 True
Details
First, use cumsum to segregate rows into groups:
v = (~df.FILTER_OK).cumsum()
v
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: FILTER_OK, dtype: int64
Next, find the size of each group, and then figure out what groups have at least X rows (in your case, 4):
v.groupby(v).transform('size')
0 3
1 3
2 3
3 6
4 6
5 6
6 6
7 6
8 6
9 1
Name: FILTER_OK, dtype: int64
v.groupby(v).transform('size').ge(4)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
AND this mask with "FILTER_OK" to ensure we only take valid rows that fit the criteria.
v.groupby(v).transform('size').ge(4) & df['FILTER_OK']
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
This is will also consider 4 consecutive False
s=df.FILTER_OK.astype(int).diff().ne(0).cumsum()
df[s.isin(s.value_counts().loc[lambda x : x>4].index)]
Out[784]:
TRN TRN_DATE FILTER_OK
4 9652 06/04/201720:32:00 True
5 965 07/04/201712:52:00 True
6 752 10/04/201717:40:00 True
7 9541 10/04/201719:29:00 True
8 7452 11/04/201712:20:00 True
One of possible options is to use itertools.groupby called on source
df.values.
An important difference of this method, compared to pd.groupby is
that if groupping key changes, then a new group is created.
So you can try the following code:
import pandas as pd
import itertools
# Source DataFrame
df = pd.DataFrame(data=[
[ 5153, '04/04/2017 11:40:00', True ], [ 7542, '04/04/2017 17:18:00', True ],
[ 875, '04/04/2017 20:08:00', True ], [ 74, '05/04/2017 20:30:00', False ],
[ 9652, '06/04/2017 20:32:00', True ], [ 965, '07/04/2017 12:52:00', True ],
[ 752, '10/04/2017 17:40:00', True ], [ 9541, '10/04/2017 19:29:00', True ],
[ 7452, '11/04/2017 12:20:00', True ], [ 9651, '12/04/2017 13:57:00', False ]],
columns=[ 'TRN', 'TRN_DATE', 'FILTER_OK' ])
# Work list
xx = []
# Collect groups for 'True' key with at least 5 members
for key, group in itertools.groupby(df.values, lambda x: x[2]):
lst = list(group)
if key and len(lst) >= 5:
xx.extend(lst)
# Create result DataFrame with the same column names
df2 = pd.DataFrame(data=xx, columns=df.columns)
This is actually part of a "group by" operation (by CRD Column).
If there are two consecutive groups of rows (Crd 111 and 333), and the second group of rows does not meet the condition (not 4 consecutive True), the first row of the group is included (the bold line), when it shouldn't
CRD TRN TRN_DATE FILTER_OK
0 111 5153 04/04/2017 11:40:00 True
1 111 7542 04/04/2017 17:18:00 True
2 256 875 04/04/2017 20:08:00 True
3 365 74 05/04/2017 20:30:00 False
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
9 333 9651 12/04/2017 13:57:00 False
10 333 961 12/04/2017 13:57:00 False
11 333 871 12/04/2017 13:57:00 False
Actual output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
Desired output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True

Pandas compare 2 dataframes by specific rows in all columns

I have the following Pandas dataframe of some raw numbers:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)
col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']
df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)
It looks like this:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
0 Quantity1 1 1 2 1 2 3 1 1 2 3
1 Quantity2 75 11 0.9 70 60 17 6 3 5 7
2 Quantity3 9 20 17 4 74 12 34 43 92 11
3 Quantity4 7 -17 102 75 41 -89 496 12 17 62
4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
7 NaN 1 6 65 4 2 22 1 34 51 12
8 NaN 2 5 3 3 3 1 6 37 20 71
9 NaN 3 3 6 6 11 6 3 46 1 34
10 NaN 4 7 78 8 55 32 2 918 34 94
11 NaN 8 3 9 3 3 5 6 0 12 1
12 NaN 4 23 2 5 7 6 55 37 59 73
13 NaN 3 27 45 66 33 4 22 91 78 46
14 NaN 8 3 6 32 65 3 6 12 6 51
15 NaN 7 11 7 84 34 898 23 68 101 21
I have a separate dataframe of a processed version of these numbers where:
some of the header rows from above have been deleted,
the column names have been changed
Here is the second dataframe:
df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Common parts of each dataframe:
For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.
Output combination:
I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.
Example:
the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.
Here is the output I am looking to get:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
Quantity1 1 1 2 1 2 3 1 1 2 3
Quantity2 75 11 0.9 70 60 17 6 3 5 7
Quantity3 9 20 17 4 74 12 34 43 92 11
Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3
Quantity4 7 -17 102 75 41 -89 496 12 17 62
Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
My attempts are giving trouble:
print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
gives
ValueError: Can only compare identically-labeled DataFrame objects
and
print (df_raw.ix[7:].values == df_processed.values) #gives False
gives
False
The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.
Question:
Is there a way to select out the output I showed above from these 2 dataframes?
Does this look like the output you're looking for?
Raw dataframe df:
Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 1 6 65 4 2
8 NaN 2 5 3 3 3
9 NaN 3 3 6 6 11
10 NaN 4 7 78 8 55
11 NaN 8 3 9 3 3
12 NaN 4 23 2 5 7
13 NaN 3 27 45 66 33
14 NaN 8 3 6 32 65
15 NaN 7 11 7 84 34
11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 22 1 34 51 12
8 1 6 37 20 71
9 6 3 46 1 34
10 32 2 918 34 94
11 5 6 0 12 1
12 6 55 37 59 73
13 4 22 91 78 46
14 3 6 12 6 51
15 898 23 68 101 21
Processed dataframe dfp:
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Code:
df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)
x = pd.DataFrame()
for col_raw in dfr.columns:
for col_p in dfp.columns:
if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
series = dfr[col_raw].head(7).tolist()
series.append(col_raw)
x[col_p] = series
x = pd.concat([df['Tr_id'].head(7), x], axis=1)
Output:
Tr_id c_1trial 14_1 14_2 8_1 8_2
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
8_3 28_1 24_1 24_2 24_3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
I think the code could be more concise but maybe this does the job.
alternative solution, using DataFrame.isin() method:
In [171]: df1
Out[171]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
3 0 3 3
4 0 4 4
In [172]: df2
Out[172]:
a b c
0 0 3 3
1 1 1 1
2 0 3 4
3 4 2 3
4 0 4 4
In [173]: common = pd.merge(df1, df2)
In [174]: common
Out[174]:
a b c
0 0 3 3
1 0 4 4
In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
a b c
3 0 3 3
4 0 4 4
Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:
select col1, .., colN from tableA
minus
select col1, .., colN from tableB
in Pandas:
In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
I came up with this using loops. It is very disappointing:
holder = []
for randm,pp in enumerate(list(df_processed)):
list1 = df_processed[pp].tolist()
for car,rr in enumerate(list(df_raw)):
list2 = df_raw.loc[7:,rr].tolist()
if list1==list2:
holder.append([rr,pp])
df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)
Output:
Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3
0 Quantity1 1 2 3 1 1 2 1 1 2 3
1 Quantity2 70 60 7 3 75 0.9 11 6 5 17
2 Quantity3 4 74 11 43 9 17 20 34 92 12
3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
4 Quantity4 75 41 62 12 7 102 -17 496 17 -89
5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30
6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1
7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33
The order of the columns is different than what I required, but that is a minor problem.
The real problem with this approach is using loops.
I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

Categories

Resources