Finding out additional transactions in two excels by using Python [duplicate] - python

This question already has answers here:
How to find dropped data after using Pandas merge in python?
(2 answers)
Closed 4 years ago.
I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df2 i could see that there is one extra transaction called 'SC-002_Signinlink' which is not there in df1. Can someone help me how to find only those extra transactions and print it to a file?
So far i had done below work to get the transactions...
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='outer')

Use indicator=True in your merge :
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
df = pd.merge(df1, df2, on='Transaction_Name', how='outer', indicator=True)
# As we do not merge on Count, we have 2 count columns (Count_x & Count_y)
# So we create a Count column which is the addition of the 2
df.Count_x = df.Count_x.fillna(0)
df.Count_y = df.Count_y.fillna(0)
print(df.dtypes)
df['Count'] = df.Count_x + df.Count_y
df = df.loc[df._merge != 'both', ['Transaction_Name', 'Count']]
print(df)
# Missing transactions list :
print(df.Transaction_Name.values.tolist())
output for print(df.dtypes)
Transaction_Name object
Count_x float64
Count_y int64
_merge category
dtype: object
output for print(df)
Transaction_Name Count
3 SC-002_Signinlink 1.0
output for print(df.Transaction_Name.values.tolist())
['SC-002_Signinlink']

Related

pandas merge resulting in duplicated columns

I am using pandas 1.4.3 and python 3.9.13
I am creating some data frames which are identical as follows:
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_2 = pd.DataFrame(data=d)
df_3 = pd.DataFrame(data=d)
df_4 = pd.DataFrame(data=d)
datasets = [df_1, df_2, df_3, df_4]
Now I am trying to merge them all in a single data frame on col1. So,I do the following:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left, right,on=['col1'], how='outer', suffixes=["_x", "_y"]), datasets)
So, I am trying to basically keep all the columns but just use some suffixes so that they stay unique. However, the issue is that since it is more than two dataframes, this ends up resulting in duplicated columns as:
col1 col2_x col2_y col2_x col2_y
0 1 3 3 3 3
1 2 4 4 4 4
I was wondering what would be the best way to do such a merge while ensuring no columns are dropped and duplicates are conserved properly with incrementally adding suffixes...
EDIT
At the moment, I am now doing it with a loop as:
merged = datasets[0]
for i in range(1, len(datasets)):
merged = pd.merge(merged, datasets[i], how='outer', on=['col1'], suffixes=[None, f"_{str(i)}"])
A little bit cumbersome solution:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_1.attrs['name']='1'
df_2 = pd.DataFrame(data=d)
df_2.attrs['name']='2'
df_3 = pd.DataFrame(data=d)
df_3.attrs['name']='3'
df_4 = pd.DataFrame(data=d)
df_4.attrs['name']='4'
datasets = [df_1, df_2, df_3, df_4]
from functools import reduce
def mrg (left,right):
return pd.merge(left, right,on=['col1'], how='outer', suffixes=["_"+str(left.attrs.get('name')), "_"+str(right.attrs.get('name'))])
df_merged = reduce(lambda left,right: mrg(left,right), datasets)

How to merge two DataFrames using specific conditions in Python Pandas?

I have two Data Frames:
DataFrame 1
df1 = pd.DataFrame()
df1["ID1"] = [np.nan, 1, np.nan, 3]
df1["ID2"] =[np.nan, np.nan , 2, 3]
df1
DataFrame 2
df2 = pd.DataFrame()
df2["ID"] = [1, 2, 3, 4]
df2
And I need to merge these two DataFrames using below conditions:
If in df1 ID1 == ID2 then I can merge df1 with df2 using df1.ID1 = df2.ID or df1.ID2 = df2.ID
If in df1 ID1 != ID2 then I have to mergre df1 with df2 using both mentioned in point 1 conditions means: df1.ID1 = df2.ID and df1.ID2 = df2.ID
I have the command as above in points 1 and 2, nevertheless I totaly do not know how to write it in Python Pandas, any suggestions ?
if I understood correctly, this will fix your problem
df1 = pd.DataFrame()
df1["ID1"] = [np.nan, 1, np.nan, 3]
df1["ID2"] =[np.nan, np.nan , 2, 3]
df2 = pd.DataFrame()
df2["ID"] = [1, 2, 3, 4]
if df1["ID1"].equals(df1["ID2"]) == True:
pass #do your merging here
else:
df1["ID1"],df1["ID2"] = df2["ID"],df2["ID"]
df1
output:
ID1 ID2
0 1 1
1 2 2
2 3 3
3 4 4

Processing a dataframe with another dataframe

I have two data frames: df1 and df2. They both include information like 'ID', 'Name', 'Score' and 'Status', which I need is to update the 'Score' in df1 if that person's status in df2 is "Edit", and I also need to drop the row in df1 if that person's status in df2 is "Cancel".
For example:
dic1 = {'ID': [1, 2, 3],
'Name':['Jack', 'Tom', 'Annie'],
'Score':[20, 10, 25],
'Status':['New', 'New', 'New']}
dic2 = {'ID': [1, 2],
'Name':['Jack', 'Tom'],
'Score':[28, 10],
'Status':['Edit', 'Cancel']}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
The output should be like:
ID Name Score Status
1 Jack 28 Edit
3 Annie 25 New
Any pointers or hints?
Use DataFrame.merge with left join first and then filter out Cancel rows and also columns ending with _ from original DataFrame:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('_', ''))
df = df.loc[df['Status'] != 'Cancel', ~df.columns.str.endswith('_')]
print (df)
ID Name Score Status
0 1 Jack 28 Edit
EDIT Add DataFrame.combine_first for repalce missing rows:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('', '_'))
df = df.loc[df['Status_'] != 'Cancel']
df1 = df.loc[:, df.columns.str.endswith('_')]
df = df1.rename(columns=lambda x: x.rstrip('_')).combine_first(df).drop(df1.columns, axis=1)
print (df)
ID Name Score Status
0 1.0 Jack 28.0 Edit
2 3.0 Annie 25.0 New
Use pandas.DataFrame.update commnad of pandas package.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
df1.update(df2)
print(df1)
df1 = df1[df1.Status != "Cancel"]
print(df1)

python pandas for loop assign column based on which dataframe it came from

i am using a for loop to go through two frames to eventually concat them.
data_frames = []
data_frames.append(df1)
data_frames.append(df2)
For data_frame in data_frames:
data_frame['col1'] = 'Test'
if date_frame.name = df1:
data_frame['col2'] = 'Apple'
else:
data_frame['col2'] = 'Orange'
The above fails, but in essence, I want to create data_frame['col2']'s value to be dependent on which dataframe it came from. So if the row is from df1, the value for that column should be 'Apple' and if not it should be 'Orange'
There are quite a few syntax errors in your code, but I believe this is what you're trying to do:
# Example Dataframes
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
# With names!
df1.name = 'df1'
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
df2.name = 'df2'
# Create a list of df1 & df2
data_frames = [df1, df2]
# For each data frame in list
for data_frame in data_frames:
# Set col1 to Test
data_frame['col1'] = 'Test'
# If the data_frame.name is 'df1'
if data_frame.name is 'df1':
# Set col2 to 'Apple'
data_frame['col2'] = 'Apple'
else:
# Else set 'col2' to 'Orange'
data_frame['col2'] = 'Orange'
# Print dataframes
for data_frame in data_frames:
print("{name}:\n{value}\n\n".format(name=data_frame.name, value=data_frame))
Output:
df1:
a col1 col2
0 1 Test Apple
1 1 Test Apple
2 1 Test Apple
df2:
a col1 col2
0 2 Test Orange
1 2 Test Orange
2 2 Test Orange
Let's use pd.concat with keys.
Using #AaronNBrock setup:
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
list_of_dfs = ['df1','df2']
df_out = pd.concat([eval(i) for i in list_of_dfs], keys=list_of_dfs)\
.rename_axis(['Source',None]).reset_index()\
.drop('level_1',axis=1)
print(df_out)
Output:
Source a
0 df1 1
1 df1 1
2 df1 1
3 df2 2
4 df2 2
5 df2 2

How to merge multilevel (i.e. MultiIndex) dataframes?

What's the python/panda way to merge on multilevel dataframe on column "t" under "cell1" and "cell2"?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(4).reshape(2, 2),
columns = [['cell 1'] * 2, ['t', 'sb']])
df2 = pd.DataFrame([[1, 5], [2, 6]],
columns = [['cell 2'] * 2, ['t', 'sb']])
Now when I tried to merge on "t", python REPL will error out
ddf = pd.merge(df1, df2, on='t', how='outer')
What's a good way to handle this?
pd.merge(df1, df2, left_on=[('cell 1', 't')], right_on=[('cell 2', 't')])
One solution is to drop the top level (e.g. cell_1 and cell_2) from the dataframes and then merge.
If you want, you can save these columns to reinstate them after the merge.
c1 = df1.columns
c2 = df2.columns
df1.columns = df1.columns.droplevel()
df2.columns = df2.columns.droplevel()
df_merged = df1.merge(df2, on='t', how='outer', suffixes=['_df1', '_df2'])
df1.columns = c1
df2.columns = c2
>>> df_merged
t sb_df1 sb_df2
0 0 1 NaN
1 2 3 6
2 1 NaN 5

Categories

Resources