Duplicate Analysis Using Python

Duplicate Analysis Using Python - python

I am a beginner in Python.
So far I have identified the duplicates using pandas lib but don't know how this will help me.
import pandas as pd
import numpy as np
dataframe = pd.read_csv("HKTW_INDIA_Duplicate_Account.csv")
dataframe.info()
name = dataframe["PARTY_NAME"duplicate_data=dataframe[name.isin(name[name.duplicated()])].sort_values("PARTY_NAME")
duplicate_data.head()
What I want: I have a set of data that is duplicated and I need to merge the duplicates based on certain conditions and need to populate the feedback in a new column.
I can do this manually also in Excel but the records are very high which will consume a lot of time. (More than 4,00,000 rows)
Primary Account ID Secondary Account ID Account Name Translated Name Created on Date Amount Today Amount Total Split Reamrks New ID
1234 245 Julia Julia 24-May-20 530 45 N
2345 Julia Julia 24-Sep-20 N
3456 42 Sara Sara 24-Aug-20 230 Y
4567 Sara Sara 24-Sep-20 Y
5678 Matt Matt 24-Jun-20 N
6789 Matt Matt 24-Sep-20 N
7890 58 Robert Robert 24-Feb-20 525 21 N
1937 Robert Robert 24-Sep-20 N
7854 55 Robert Robert 24-Jan-20 543 74 N
Conditions:
Only those accounts can be merged where we have "N" in Split Column and Amount_Total & Amount_Today is Blank.
Expected Output:
Value in Secondary_Account_ID or not.
Example: Row 2 does not have any Secondary Registry ID and does not have any value in Amount_Total & Amount_Todat but Row 1 has the value in Secondary_Account_ID, so in this case, Row 2 can be merged to Row 1 because both have the same name. In the remarks columns, it should give me Winner account have secondary id(row 2 & row 1) and copy the Account ID from row 1 and paste in (row 2 & row 1) (Column "New ID")
Expected Output:
If duplicate accounts have Amount_Total and Amount_Today then it should not be merged.
Expected Output:
If duplicate accounts do not have any value in Secondary_Account_ID then it will check for Amount_today or Amount_total column, if the value is there in these two columns then the account which does not have values in these two columns will be merged to another one.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for one account then that account will be considered as a winner account.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for more than one account then that account which has the maximum value in Amount_Total will be considered as winner account.
Expected Output:
If Secondary_Account_ID, Total_Amount, and Today_Amount is blank then it should consider the oldest account.
Expected Output:

Related

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!

I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.

As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

How do I write my own complex fuzzy match with pandas?

I have two dataframes: one with an account number, a purchase ID, a total cost, and a date
and another with account number, money paid, and date:
To make it clear there are two accounts, 11111 and 33333, but there are some typos in the dataframes.
AccountNumber Purchase ID Total Cost Date
11111 100 10 1/1/2020
333333 100 10 1/1/2020
33333 200 20 2/2/2020
11111 300 30 4/2/2020
AccountNumber Date Money Paid:
11111 1/1/2020 5
111111 1/2/2020 2
33333 1/2/2020 1
33333 2/3/2020 15
1111 4/2/2020 30
Each Purchase ID is an identifier for a single purchase, however multiple accounts may be involved within the purchase, such as account 11111 and 33333. Moreover, an account may be used for two different purchases such as account 11111 with Purchase ID 100 and 300. In the second dataframe, payments can be made in increments, so I need to use the date to make sure that the payment is associated with the correct Purchase ID. Moreover, there may be some slight errors in the account numbers so I need to use a fuzzy match. In the end, I want to get a dataframe that is grouped by Purchase ID and compares how much the accounts paid vs. the cost of the item:
Purchase ID Date Total Cost Amount Paid $Owed
100 1/1/2020 10 8 2
200 2/2/2020 20 15 5
300 4/2/2020 30 30 0
As you can see, this is a fairly complicated question. I first tried just joining the two dataframes based on AccountNumber but I ran into issues due to the slight differences as well as the problem of matching the Accountnumber transaction to the correct Purchase ID with the date, because one error with merging is that you might accidentally sum up money paid for the wrong Purchase since accounts can be involved with different purchases.
I'm thinking about iterating through the rows and using if statements/regex but I feel like that would take too long.
What's the simplest and efficient solution to this problem? I'm a beginner at pandas and python.

The library pandas-dedupe can help you to link the two dataframe by using a combination of active learning and clustering. have a look at the repo.
Here is the sample code (and step by step explanation):
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
# At this point pandas_dedupe will ask you to label a sample of records according
# to whether they are distinct or the same observation.
# After that, pandas-dedupe uses its knowledge to cluster together similar records.
#send output to csv
df_final.to_csv('linkage_output.csv')

Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?

You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior

Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

How to aggregate rows in a pandas dataframe

I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.