Compare custom column values and print difference in pandas DataFrame - python

I have two DataFrames:
df1 = {'MS': [1000, 1005, 1007, NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '---', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '--', 'A', 'B', 'B'],
'Data3': [1, 0, 0, NaN, 1, 1, 0]}
df2 = {'MS': [1001, 1006, 1010, NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '---', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '--', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, NaN, 1, 0, 0, 0]}
I want to compare every row in 'df1' with 'df2', except 'MS' column, where 'MS' is time in milliseconds. Both the DFs have identical columns. Column 'MS' might contain NaN, which case need to be ignored.
By comparing, I want to print
Matching rows in 'df1' and 'df2', one below the other, with a new column 'Diff' having 'MS' difference between the values; from above example, row 3 in 'df1' is matching with row 1 of 'df2', so print,
MS Diff Command Data1 Data2 Data3
0 1007 NaN WR 120 B 0
1 1001 6 WR 120 B 0
Print all unmatched rows in df1 and df2
Compare function should be generic enough to accept an argument with columns of choice and compare only those values in columns to consider match or no-match. For example, every iteration I may pass different column lists,
itr1_comp_col = ['Command', 'Data1', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
For respective iterations, it shall compare only those column values of user choice.
So far I am not able to produce any satisfactory code. I am a beginner to Pandas and
I have tried grouping them by 'Command' column and concatenating two identical groups by dropping duplicates, as discussed in this thread.
I have manually looped through values in every row and compared, which is absolutely inefficient, as data is very huge, some million entries.
Please suggest an efficient way to handle above case. Thanks in advance.

I will answer my own question, wrt, #Ankur said in his comments:
Even though this doesn't print matching rows one below the other, however it partially fulfils the requirement.
Referring to this page, merge can be used to find difference in DFs. Especially, the argument how= will do the work. Below is the function:
def find_diff(df1: pd.DataFrame, df2: pd.DataFrame, col_itr):
res = pd.merge(df1, df2, on=col_itr, how='outer')
res['Diff'] = res['MS_x'] - res['MS_y']
print (res)
Usage:
import pandas as pd
import numpy as np
d1 = {'MS': [1000, 1005, 1007, np.NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '-', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, np.NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '-', 'A', 'B', 'B'],
'Data3': [1, 0, 0, np.NaN, 1, 1, 0]}
d2 = {'MS': [1001, 1006, 1010, np.NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '-', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, np.NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '-', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, np.NaN, 1, 0, 0, 0]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
itr1_comp_col = ['Command', 'Data1', 'Data2', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
find_diff(df1, df2, itr1_comp_col)
find_diff(df1, df2, itr2_comp_col)

Related

How to create conditional group-by with pandas?

Suppose I have a dataframe like this:
data = [['A', 'HIGH', 120, 200],
['A', 'MID', 350, 200],
['B', 'HIGH', 130, 100],
['B', 'HIGH', 70, 100],
['A', 'MID', 130, 200]]
df = pd.DataFrame(data, columns=['Category', 'Range', 'Total', 'Avg'])
Now, I want to create a Group By that when the category is A, it groups by category and Range, while when it is B, it group only by category.
Is it possible to do?
Thanks!
Check below code. It will also work B has multiple range.
import pandas as pd
import numpy as np
data = [['A', 'HIGH', 120, 200],
['A', 'MID', 350, 200],
['A', 'MID', 130, 200],
['B', 'HIGH', 130, 100],
['B', 'MID', 70, 100],
['B', 'MID', 70, 100]
]
df = pd.DataFrame(data, columns=['Category', 'Range', 'Total', 'Avg'])
df[['Total_New','Avg_New']] = df.assign(group_col = np.where(df['Category']=='A',df.Category+df.Range, df.Category)).\
groupby('group_col')['Total','Avg'].transform('sum')
df
Output:

Python pandas dataframe to list by column instead of row

I would like to know if there is an easy way to convert pandas dataframe to list by column instead of row ? for the example below, can we have [['Apple','Orange','Kiwi','Mango'],[220,200,1000,800],['a','o','k','m']] ?
Appreciate if anyone can advise on this. Thanks
import pandas as pd
data = {'Brand': ['Apple','Orange','Kiwi','Mango'],
'Price': [220,200,1000,800],
'Type' : ['a','o','k','m']
}
df = pd.DataFrame(data, columns = ['Brand', 'Price', 'Type'])
df.head()
df.values.tolist()
#[['Apple', 220, 'a'], ['Orange', 200, 'o'], ['Kiwi', 1000, 'k'], ['Mango', 800, 'm']]
#Anyway to have ?????
#[['Apple','Orange','Kiwi','Mango'],[220,200,1000,800],['a','o','k','m']]
Just use Transpose(T) attribute:
lst=df.T.values.tolist()
OR
use transpose() method:
lst=df.transpose().values.tolist()
If you print lst you will get:
[['Apple', 'Orange', 'Kiwi', 'Mango'], [220, 200, 1000, 800], ['a', 'o', 'k', 'm']]

Reshaping dataframe with multiple columns to row groups

inp Dataframe
df = pd.DataFrame({'Loc': ['Hyd', 'Hyd','Bang','Bang'],
'Item': ['A', 'B', 'A', 'B'],
'Month' : ['May','May','June','June'],
'Sales': [100, 100, 200, 200],
'Values': [1000, 1000, 2000, 2000]
})
My expected output
df = pd.DataFrame({'Loc': ['Hyd', 'Hyd','Hyd','Hyd','Bang','Bang','Bang','Bang'],
'Item': ['A', 'A', 'B', 'B','A', 'A', 'B', 'B'],
'VAR' : ['Sales','Values','Sales','Values','Sales','Values','Sales','Values'],
'May': [100, 1000, 100, 1000, 100, 1000, 100, 1000],
'June': [200, 2000, 200, 2000, 200, 2000, 200, 2000]
})
I have tried multiple solutions using melt and pivot but nothing seems to work ? not sure where I am missing ?
Here's my code
dem.melt(['Part','IBU','Date1']).pivot_table(index=['Part','IBU','variable'],columns=['Date1'])
Any help would be much appreciated
You can use melt and pivot functions in pandas:
df_melted = pd.melt(df, id_vars=["Loc", "Item", "Month"], value_vars=["Sales", "Values"])
This will result:
And then:
df_pivot = df_melted.pivot_table(index=["Loc", "Item", "variable"], columns="Month")
So, the final output will be:

Need help in Python Pivot table group by

I have the a dataframe something like the below struture :
I need to make it look it as this :
Can any one help pls ?
You can use the groupby() function with a list and append summarising functions with agg().
import pandas as pd
df = pd.DataFrame({'customer': [1,2,1,3,1,2,3],
"group_code": ['111', '111', '222', '111', '111', '111', '333'],
"ind_code": ['A', 'B', 'AA', 'A', 'AAA', 'C', 'BBB'],
"amount": [100, 200, 140, 400, 225, 125, 600],
"card": ['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'YYY', 'XXX']})
df_groupby = df.groupby(['customer', 'group_code', 'ind_code']).agg(['count', 'mean'])

python pandas converting dataframe(s) to lists

I have an excel with 2 data frames , one data frame on score card and other data frame on consolidation basis
import pandas as pd
df_scr_crd = {'Subject': ['MATH', 'MATH', 'MATH', 'MATH', 'PSY', 'PSY', 'PSY', 'PSY'],
'SCR_STRT': [10, 20, 30, 99999, 'A', 'B', 'C', 'D'],
'POINTS': [100, 200, 300, 500, 10, 20, 30, 40]}
df_scr_crd_d = pd.DataFrame(df_scr_crd, columns = ['Subject', 'SCR_STRT', 'POINTS'])
df_scr_cns = {'Subject': ['MATH', 'PSY'],
'CNS': ['min', 'max']}
df_scr_cns_d = pd.DataFrame(df_scr_cns, columns = ['Subject', 'CNS'])
df_scr_crd_d
df_scr_cns_d
I want to generate lists/variable assignments from this data frame
The expected output is
MATH_df_scr_crd_bin = [10, 20, 30, 99999]
MATH_df_scr_crd_val = [100, 200, 300, 500]
PSY_df_scr_crd_bin = ['A', 'B', 'C', 'D']
PSY_df_scr_crd_val = [10, 20, 30, 40]
MATH_df_scr_cns = 'min'
MATH_df_scr_cns = 'max'
Is there any easy way to convert a data frame to lists ?
Thx in advance
Vittal
You can simply use .tolist() on the relevant series, e.g.:
>>> df_scr_crd_d.loc[df_scr_crd_d.Subject == 'MATH', 'SCR_STRT'].tolist()
[10, 20, 30, 99999]
>>> df_scr_crd_d.loc[df_scr_crd_d.Subject == 'MATH', 'POINTS'].tolist()
[100, 200, 300, 500]
For the whole dataframe, you can convert it to a dictionary keyed on the column names as follows:
>>> df_scr_crd_d.to_dict('list')
{'POINTS': [100, 200, 300, 500, 10, 20, 30, 40],
'SCR_STRT': [10, 20, 30, 99999, 'A', 'B', 'C', 'D'],
'Subject': ['MATH', 'MATH', 'MATH', 'MATH', 'PSY', 'PSY', 'PSY', 'PSY']}

Categories

Resources