Pandas custom groupby fill - python

I have this dataset:
menu alternative id varA varB varC
1 NaN A NaN NaN NaN
1 NaN A NaN NaN NaN
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 NaN A NaN NaN NaN
7 NaN A NaN NaN NaN
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 NaN B NaN NaN NaN
6 NaN B NaN NaN NaN
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 NaN C NaN NaN NaN
5 NaN C NaN NaN NaN
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
As you can see here, I have some null values which I need to fill. However, I need to fill them in a somewhat custom manner. For every id and for every menu I need to fill the null values based on random selection among the same menus (same menu number) in different ids which have non-null values.
Example. The menu 1 in id A has null values. I want to randomly select menu 1 in different id which has non-null values and fill them there. Let it be, id B and menu 1. For menu 7 in id A let it be menu 7 in id C and etc.
It is somehow similar to this question but iin my case, the filling should happen within the same "subgroups" if we can say so.
The final output should be something like this:
menu alternative id varA varB varC
1 82 A 21.34428845 9.901326583 1.053134597
1 91 A 19.04689216 16.29217346 29.56962312
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 17 A 4.319991082 21.29233667 3.516184987
7 8 A 24.09490443 9.507000131 14.93472971
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 57 B 1.068603487 27.95362014 1.334049372
6 100 B 26.31848796 6.757305213 4.742282633
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 48 C 5.591317152 25.17616679 24.30522374
5 16 C 23.85069753 23.12154586 0.781450997
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
Any guidance would be appreciated. Maybe even there is some groupby apply logic which could assist in this.

You can run fillna() row-wise in apply(), then fill with a random sample from the dataframe filtered by your conditions:
df.apply(lambda row: row.fillna(df[(df['menu'] == row['menu']) & (df['id'] != row['id'])].dropna().sample(n=1).iloc[0]), axis=1)

Related

average of one wrt another or averageifs in python

I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel ,
I have tried methods like groupby mean() but that does not give correct results
Your question is not clear but you may be looking for:
df.groupby(['DC','Brand'])['Rate'].mean()
AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform():
# Sample DF
Brand Rate
0 A 45
1 B 100
2 C 28
3 A 92
4 B 2
5 C 79
6 A 48
7 B 97
8 C 72
9 D 14
10 D 16
11 D 64
12 E 85
13 E 22
Result:
df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean')
print(df)
Brand Rate Avg Rate by Brand
0 A 45 61.666667
1 B 100 66.333333
2 C 28 59.666667
3 A 92 61.666667
4 B 2 66.333333
5 C 79 59.666667
6 A 48 61.666667
7 B 97 66.333333
8 C 72 59.666667
9 D 14 31.333333
10 D 16 31.333333
11 D 64 31.333333
12 E 85 53.500000
13 E 22 53.500000

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading
Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

insert dataframe into a dataframe - Python/Pandas

Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83

Pandas compare 2 dataframes by specific rows in all columns

I have the following Pandas dataframe of some raw numbers:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)
col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']
df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)
It looks like this:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
0 Quantity1 1 1 2 1 2 3 1 1 2 3
1 Quantity2 75 11 0.9 70 60 17 6 3 5 7
2 Quantity3 9 20 17 4 74 12 34 43 92 11
3 Quantity4 7 -17 102 75 41 -89 496 12 17 62
4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
7 NaN 1 6 65 4 2 22 1 34 51 12
8 NaN 2 5 3 3 3 1 6 37 20 71
9 NaN 3 3 6 6 11 6 3 46 1 34
10 NaN 4 7 78 8 55 32 2 918 34 94
11 NaN 8 3 9 3 3 5 6 0 12 1
12 NaN 4 23 2 5 7 6 55 37 59 73
13 NaN 3 27 45 66 33 4 22 91 78 46
14 NaN 8 3 6 32 65 3 6 12 6 51
15 NaN 7 11 7 84 34 898 23 68 101 21
I have a separate dataframe of a processed version of these numbers where:
some of the header rows from above have been deleted,
the column names have been changed
Here is the second dataframe:
df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Common parts of each dataframe:
For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.
Output combination:
I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.
Example:
the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.
Here is the output I am looking to get:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
Quantity1 1 1 2 1 2 3 1 1 2 3
Quantity2 75 11 0.9 70 60 17 6 3 5 7
Quantity3 9 20 17 4 74 12 34 43 92 11
Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3
Quantity4 7 -17 102 75 41 -89 496 12 17 62
Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
My attempts are giving trouble:
print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
gives
ValueError: Can only compare identically-labeled DataFrame objects
and
print (df_raw.ix[7:].values == df_processed.values) #gives False
gives
False
The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.
Question:
Is there a way to select out the output I showed above from these 2 dataframes?
Does this look like the output you're looking for?
Raw dataframe df:
Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 1 6 65 4 2
8 NaN 2 5 3 3 3
9 NaN 3 3 6 6 11
10 NaN 4 7 78 8 55
11 NaN 8 3 9 3 3
12 NaN 4 23 2 5 7
13 NaN 3 27 45 66 33
14 NaN 8 3 6 32 65
15 NaN 7 11 7 84 34
11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 22 1 34 51 12
8 1 6 37 20 71
9 6 3 46 1 34
10 32 2 918 34 94
11 5 6 0 12 1
12 6 55 37 59 73
13 4 22 91 78 46
14 3 6 12 6 51
15 898 23 68 101 21
Processed dataframe dfp:
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Code:
df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)
x = pd.DataFrame()
for col_raw in dfr.columns:
for col_p in dfp.columns:
if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
series = dfr[col_raw].head(7).tolist()
series.append(col_raw)
x[col_p] = series
x = pd.concat([df['Tr_id'].head(7), x], axis=1)
Output:
Tr_id c_1trial 14_1 14_2 8_1 8_2
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
8_3 28_1 24_1 24_2 24_3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
I think the code could be more concise but maybe this does the job.
alternative solution, using DataFrame.isin() method:
In [171]: df1
Out[171]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
3 0 3 3
4 0 4 4
In [172]: df2
Out[172]:
a b c
0 0 3 3
1 1 1 1
2 0 3 4
3 4 2 3
4 0 4 4
In [173]: common = pd.merge(df1, df2)
In [174]: common
Out[174]:
a b c
0 0 3 3
1 0 4 4
In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
a b c
3 0 3 3
4 0 4 4
Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:
select col1, .., colN from tableA
minus
select col1, .., colN from tableB
in Pandas:
In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
I came up with this using loops. It is very disappointing:
holder = []
for randm,pp in enumerate(list(df_processed)):
list1 = df_processed[pp].tolist()
for car,rr in enumerate(list(df_raw)):
list2 = df_raw.loc[7:,rr].tolist()
if list1==list2:
holder.append([rr,pp])
df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)
Output:
Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3
0 Quantity1 1 2 3 1 1 2 1 1 2 3
1 Quantity2 70 60 7 3 75 0.9 11 6 5 17
2 Quantity3 4 74 11 43 9 17 20 34 92 12
3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
4 Quantity4 75 41 62 12 7 102 -17 496 17 -89
5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30
6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1
7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33
The order of the columns is different than what I required, but that is a minor problem.
The real problem with this approach is using loops.
I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

Categories

Resources