I assign the eligible index value to A column and then df.ffill()
Now I want to use the value of A column as an index and assign the obtained value to the expcted column
I try df['expected']=df['price'][df['A']] but it doesn't work.
input
import pandas as pd
import numpy as np
d={'trade_date':['2021-08-10','2021-08-11','2021-08-12','2021-08-13','2021-08-14','2021-08-15','2021-08-16','2021-08-17','2021-08-18','2021-08-19','2021-08-20',],'price':[2,12,8,10,11,18,7,19,9,8,12],'cond':[True,False,True,False,True,False,True,False,True,True,True]}
df = pd.DataFrame(d)
df.index=pd.to_datetime(df.trade_date)
df['A']=df.index.where(df['cond'])
df['A']=df['A'].ffill()
df.to_clipboard()
df
expected result table
trade_date price cond A expected
2021/8/10 2 TRUE 2021/8/10 2
2021/8/11 12 FALSE 2021/8/10 2
2021/8/12 8 TRUE 2021/8/12 8
2021/8/13 10 FALSE 2021/8/12 8
2021/8/14 11 TRUE 2021/8/14 11
2021/8/15 18 FALSE 2021/8/14 11
2021/8/16 7 TRUE 2021/8/16 7
2021/8/17 19 FALSE 2021/8/16 7
2021/8/18 9 TRUE 2021/8/18 9
2021/8/19 8 TRUE 2021/8/19 8
2021/8/20 12 TRUE 2021/8/20 12
Try this:
df['expected'] = df['A'].map(df['price'])
print(df)
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
Alternatively, you could use groupby and transform:
df.assign(expected=df.groupby(['A'])['price'].transform('first'))
-------------------------------------------------
price cond A expected
trade_date
2021-08-10 2 True 2021-08-10 2
2021-08-11 12 False 2021-08-10 2
2021-08-12 8 True 2021-08-12 8
2021-08-13 10 False 2021-08-12 8
2021-08-14 11 True 2021-08-14 11
2021-08-15 18 False 2021-08-14 11
2021-08-16 7 True 2021-08-16 7
2021-08-17 19 False 2021-08-16 7
2021-08-18 9 True 2021-08-18 9
2021-08-19 8 True 2021-08-19 8
2021-08-20 12 True 2021-08-20 12
-------------------------------------------------
This approach groups by A, takes the first value of price for each group and assigns it to the corresponding group.
There are duplicated transactions in a bank dataframe(DF). ID is customer IDs. Duplicated transaction is a multi-swipe, where a vendor accidentally charges a customer's card multiple times within a short time span (2 minutes here).
DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])
ID Dollar transactionDateTime
0 111 1 2016-01-08 19:04:50
1 111 3 2016-01-29 19:03:55
2 111 1 2016-01-08 19:05:50
3 111 10 2016-01-08 20:08:50
4 222 25 2016-01-08 19:04:50
5 222 8 2016-02-08 19:04:50
6 222 25 2016-03-08 19:04:50
7 333 9 2016-01-08 19:04:50
8 333 20 2016-03-08 19:05:53
9 333 9 2016-01-08 19:03:20
10 333 9 2016-01-08 19:02:15
11 111 10 2016-02-08 20:08:50
I want to add a column to my dataframe, which recognizes the duplicated transactions (dollar amount of same customer ID should be the same, and transaction date time should be less than 2 minutes). Please consider the first transaction to be "normal".
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 10 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 Yes
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 No
11 111 10 2016-02-08 20:08:50 No
IIUC, you can groupby and diff to check whether the difference between successive transactions is less than 120 seconds:
df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
.groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
.diff()
.dt.total_seconds()
.lt(120))
df
ID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 False
1 111 3 2016-01-29 19:03:55 False
2 111 1 2016-01-08 19:05:50 True
3 111 100 2016-01-08 20:08:50 False
4 222 25 2016-01-08 19:04:50 False
5 222 8 2016-02-08 19:04:50 False
6 222 25 2016-03-08 19:04:50 False
7 333 9 2016-01-08 19:04:50 True
8 333 20 2016-03-08 19:05:53 False
9 333 9 2016-01-08 19:03:20 True
10 333 9 2016-01-08 19:02:15 False
11 111 100 2016-02-08 20:08:50 False
Note that your data isn't sorted, so you must sort it first to get a meaningful result.
You can use:
m=(DF.groupby('customerID')['transactionDateTime'].diff()/ np.timedelta64(1, 'm')).le(2)
DF['Duplicated?']=np.where((DF.Dollar.duplicated()&m),'Yes','No')
print(DF)
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 3 2016-01-29 19:03:55 No
2 111 1 2016-01-08 19:05:50 Yes
3 111 100 2016-01-08 20:08:50 No
4 222 25 2016-01-08 19:04:50 No
5 222 8 2016-02-08 19:04:50 No
6 222 25 2016-03-08 19:04:50 No
7 333 9 2016-01-08 19:04:50 No
8 333 20 2016-03-08 19:05:53 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:02:15 Yes
11 111 100 2016-02-08 20:08:50 No
We can first mark the duplicate payments in your Dollar column. Then mark per customer if the difference is less then 2 minutes:
DF.sort_values(['customerID', 'transactionDateTime'], inplace=True)
m1 = DF.groupby('customerID', sort=False)['Dollar'].apply(lambda x: x.duplicated())
m2 = DF.groupby('customerID', sort=False)['transactionDateTime'].diff() <= pd.Timedelta(2, unit='minutes')
DF['Duplicated?'] = np.where(m1 & m2, 'Yes', 'No')
customerID Dollar transactionDateTime Duplicated?
0 111 1 2016-01-08 19:04:50 No
1 111 1 2016-01-08 19:05:50 Yes
2 111 100 2016-01-08 20:08:50 No
3 111 3 2016-01-29 19:03:55 No
4 111 100 2016-02-08 20:08:50 No
5 222 25 2016-01-08 19:04:50 No
6 222 8 2016-02-08 19:04:50 No
7 222 25 2016-03-08 19:04:50 No
8 333 9 2016-01-08 19:02:15 No
9 333 9 2016-01-08 19:03:20 Yes
10 333 9 2016-01-08 19:04:50 Yes
11 333 20 2016-03-08 19:05:53 No
I made pd.Timedelta(minutes=2) to compare against the diff()
m2 = pd.Timedelta(minutes=2)
DF['dup'] = DF.sort_values('transactionDateTime').groupby(['Dollar','ID']).transactionDateTime.diff().abs().le(m2).astype(int)
Out[272]:
Dollar ID transactionDateTime dup
0 1 111 2016-01-08 19:04:50 0
1 3 111 2016-01-29 19:03:55 0
2 1 111 2016-01-08 19:05:50 1
3 100 111 2016-01-08 20:08:50 0
4 25 222 2016-01-08 19:04:50 0
5 8 222 2016-02-08 19:04:50 0
6 25 222 2016-03-08 19:04:50 0
7 9 333 2016-01-08 19:04:50 1
8 20 333 2016-03-08 19:05:53 0
9 9 333 2016-01-08 19:03:20 1
10 9 333 2016-01-08 19:02:15 0
11 100 111 2016-02-08 20:08:50 0
I am trying to find within a dataframe if there are at least X consecutive operations (I already included a column "Filter_OK" that calculates if the row meets the criteria), and extract that group of rows.
TRN TRN_DATE FILTER_OK
0 5153 04/04/2017 11:40:00 True
1 7542 04/04/2017 17:18:00 True
2 875 04/04/2017 20:08:00 True
3 74 05/04/2017 20:30:00 False
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
9 9651 12/04/2017 13:57:00 False
For this example, if I am looking for 4 operations.
OUTPUT DESIRED:
TRN TRN_DATE FILTER_OK
4 9652 06/04/2017 20:32:00 True
5 965 07/04/2017 12:52:00 True
6 752 10/04/2017 17:40:00 True
7 9541 10/04/2017 19:29:00 True
8 7452 11/04/2017 12:20:00 True
How can i subset the operations I need?
You may do this using cumsum, followed by groupby, and transform:
v = (~df.FILTER_OK).cumsum()
df[v.groupby(v).transform('size').ge(4) & df['FILTER_OK']]
TRN TRN_DATE FILTER_OK
4 9652 2017-06-04 20:32:00 True
5 965 2017-07-04 12:52:00 True
6 752 2017-10-04 17:40:00 True
7 9541 2017-10-04 19:29:00 True
8 7452 2017-11-04 12:20:00 True
Details
First, use cumsum to segregate rows into groups:
v = (~df.FILTER_OK).cumsum()
v
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: FILTER_OK, dtype: int64
Next, find the size of each group, and then figure out what groups have at least X rows (in your case, 4):
v.groupby(v).transform('size')
0 3
1 3
2 3
3 6
4 6
5 6
6 6
7 6
8 6
9 1
Name: FILTER_OK, dtype: int64
v.groupby(v).transform('size').ge(4)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
AND this mask with "FILTER_OK" to ensure we only take valid rows that fit the criteria.
v.groupby(v).transform('size').ge(4) & df['FILTER_OK']
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 False
Name: FILTER_OK, dtype: bool
This is will also consider 4 consecutive False
s=df.FILTER_OK.astype(int).diff().ne(0).cumsum()
df[s.isin(s.value_counts().loc[lambda x : x>4].index)]
Out[784]:
TRN TRN_DATE FILTER_OK
4 9652 06/04/201720:32:00 True
5 965 07/04/201712:52:00 True
6 752 10/04/201717:40:00 True
7 9541 10/04/201719:29:00 True
8 7452 11/04/201712:20:00 True
One of possible options is to use itertools.groupby called on source
df.values.
An important difference of this method, compared to pd.groupby is
that if groupping key changes, then a new group is created.
So you can try the following code:
import pandas as pd
import itertools
# Source DataFrame
df = pd.DataFrame(data=[
[ 5153, '04/04/2017 11:40:00', True ], [ 7542, '04/04/2017 17:18:00', True ],
[ 875, '04/04/2017 20:08:00', True ], [ 74, '05/04/2017 20:30:00', False ],
[ 9652, '06/04/2017 20:32:00', True ], [ 965, '07/04/2017 12:52:00', True ],
[ 752, '10/04/2017 17:40:00', True ], [ 9541, '10/04/2017 19:29:00', True ],
[ 7452, '11/04/2017 12:20:00', True ], [ 9651, '12/04/2017 13:57:00', False ]],
columns=[ 'TRN', 'TRN_DATE', 'FILTER_OK' ])
# Work list
xx = []
# Collect groups for 'True' key with at least 5 members
for key, group in itertools.groupby(df.values, lambda x: x[2]):
lst = list(group)
if key and len(lst) >= 5:
xx.extend(lst)
# Create result DataFrame with the same column names
df2 = pd.DataFrame(data=xx, columns=df.columns)
This is actually part of a "group by" operation (by CRD Column).
If there are two consecutive groups of rows (Crd 111 and 333), and the second group of rows does not meet the condition (not 4 consecutive True), the first row of the group is included (the bold line), when it shouldn't
CRD TRN TRN_DATE FILTER_OK
0 111 5153 04/04/2017 11:40:00 True
1 111 7542 04/04/2017 17:18:00 True
2 256 875 04/04/2017 20:08:00 True
3 365 74 05/04/2017 20:30:00 False
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
9 333 9651 12/04/2017 13:57:00 False
10 333 961 12/04/2017 13:57:00 False
11 333 871 12/04/2017 13:57:00 False
Actual output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
**8 333 7452 11/04/2017 12:20:00 True**
Desired output:
CRD TRN TRN_DATE FILTER_OK
4 111 9652 06/04/2017 20:32:00 True
5 111 965 07/04/2017 12:52:00 True
6 111 752 10/04/2017 17:40:00 True
7 111 9541 10/04/2017 19:29:00 True
I have the following Pandas dataframe of some raw numbers:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)
col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']
df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)
It looks like this:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
0 Quantity1 1 1 2 1 2 3 1 1 2 3
1 Quantity2 75 11 0.9 70 60 17 6 3 5 7
2 Quantity3 9 20 17 4 74 12 34 43 92 11
3 Quantity4 7 -17 102 75 41 -89 496 12 17 62
4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
7 NaN 1 6 65 4 2 22 1 34 51 12
8 NaN 2 5 3 3 3 1 6 37 20 71
9 NaN 3 3 6 6 11 6 3 46 1 34
10 NaN 4 7 78 8 55 32 2 918 34 94
11 NaN 8 3 9 3 3 5 6 0 12 1
12 NaN 4 23 2 5 7 6 55 37 59 73
13 NaN 3 27 45 66 33 4 22 91 78 46
14 NaN 8 3 6 32 65 3 6 12 6 51
15 NaN 7 11 7 84 34 898 23 68 101 21
I have a separate dataframe of a processed version of these numbers where:
some of the header rows from above have been deleted,
the column names have been changed
Here is the second dataframe:
df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Common parts of each dataframe:
For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.
Output combination:
I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.
Example:
the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.
Here is the output I am looking to get:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
Quantity1 1 1 2 1 2 3 1 1 2 3
Quantity2 75 11 0.9 70 60 17 6 3 5 7
Quantity3 9 20 17 4 74 12 34 43 92 11
Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3
Quantity4 7 -17 102 75 41 -89 496 12 17 62
Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
My attempts are giving trouble:
print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
gives
ValueError: Can only compare identically-labeled DataFrame objects
and
print (df_raw.ix[7:].values == df_processed.values) #gives False
gives
False
The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.
Question:
Is there a way to select out the output I showed above from these 2 dataframes?
Does this look like the output you're looking for?
Raw dataframe df:
Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 1 6 65 4 2
8 NaN 2 5 3 3 3
9 NaN 3 3 6 6 11
10 NaN 4 7 78 8 55
11 NaN 8 3 9 3 3
12 NaN 4 23 2 5 7
13 NaN 3 27 45 66 33
14 NaN 8 3 6 32 65
15 NaN 7 11 7 84 34
11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 22 1 34 51 12
8 1 6 37 20 71
9 6 3 46 1 34
10 32 2 918 34 94
11 5 6 0 12 1
12 6 55 37 59 73
13 4 22 91 78 46
14 3 6 12 6 51
15 898 23 68 101 21
Processed dataframe dfp:
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Code:
df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)
x = pd.DataFrame()
for col_raw in dfr.columns:
for col_p in dfp.columns:
if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
series = dfr[col_raw].head(7).tolist()
series.append(col_raw)
x[col_p] = series
x = pd.concat([df['Tr_id'].head(7), x], axis=1)
Output:
Tr_id c_1trial 14_1 14_2 8_1 8_2
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
8_3 28_1 24_1 24_2 24_3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
I think the code could be more concise but maybe this does the job.
alternative solution, using DataFrame.isin() method:
In [171]: df1
Out[171]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
3 0 3 3
4 0 4 4
In [172]: df2
Out[172]:
a b c
0 0 3 3
1 1 1 1
2 0 3 4
3 4 2 3
4 0 4 4
In [173]: common = pd.merge(df1, df2)
In [174]: common
Out[174]:
a b c
0 0 3 3
1 0 4 4
In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
a b c
3 0 3 3
4 0 4 4
Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:
select col1, .., colN from tableA
minus
select col1, .., colN from tableB
in Pandas:
In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
I came up with this using loops. It is very disappointing:
holder = []
for randm,pp in enumerate(list(df_processed)):
list1 = df_processed[pp].tolist()
for car,rr in enumerate(list(df_raw)):
list2 = df_raw.loc[7:,rr].tolist()
if list1==list2:
holder.append([rr,pp])
df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)
Output:
Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3
0 Quantity1 1 2 3 1 1 2 1 1 2 3
1 Quantity2 70 60 7 3 75 0.9 11 6 5 17
2 Quantity3 4 74 11 43 9 17 20 34 92 12
3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
4 Quantity4 75 41 62 12 7 102 -17 496 17 -89
5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30
6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1
7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33
The order of the columns is different than what I required, but that is a minor problem.
The real problem with this approach is using loops.
I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.