How do I write my own complex fuzzy match with pandas?

How do I write my own complex fuzzy match with pandas? - python

I have two dataframes: one with an account number, a purchase ID, a total cost, and a date
and another with account number, money paid, and date:
To make it clear there are two accounts, 11111 and 33333, but there are some typos in the dataframes.
AccountNumber Purchase ID Total Cost Date
11111 100 10 1/1/2020
333333 100 10 1/1/2020
33333 200 20 2/2/2020
11111 300 30 4/2/2020
AccountNumber Date Money Paid:
11111 1/1/2020 5
111111 1/2/2020 2
33333 1/2/2020 1
33333 2/3/2020 15
1111 4/2/2020 30
Each Purchase ID is an identifier for a single purchase, however multiple accounts may be involved within the purchase, such as account 11111 and 33333. Moreover, an account may be used for two different purchases such as account 11111 with Purchase ID 100 and 300. In the second dataframe, payments can be made in increments, so I need to use the date to make sure that the payment is associated with the correct Purchase ID. Moreover, there may be some slight errors in the account numbers so I need to use a fuzzy match. In the end, I want to get a dataframe that is grouped by Purchase ID and compares how much the accounts paid vs. the cost of the item:
Purchase ID Date Total Cost Amount Paid $Owed
100 1/1/2020 10 8 2
200 2/2/2020 20 15 5
300 4/2/2020 30 30 0
As you can see, this is a fairly complicated question. I first tried just joining the two dataframes based on AccountNumber but I ran into issues due to the slight differences as well as the problem of matching the Accountnumber transaction to the correct Purchase ID with the date, because one error with merging is that you might accidentally sum up money paid for the wrong Purchase since accounts can be involved with different purchases.
I'm thinking about iterating through the rows and using if statements/regex but I feel like that would take too long.
What's the simplest and efficient solution to this problem? I'm a beginner at pandas and python.

The library pandas-dedupe can help you to link the two dataframe by using a combination of active learning and clustering. have a look at the repo.
Here is the sample code (and step by step explanation):
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
# At this point pandas_dedupe will ask you to label a sample of records according
# to whether they are distinct or the same observation.
# After that, pandas-dedupe uses its knowledge to cluster together similar records.
#send output to csv
df_final.to_csv('linkage_output.csv')

Related

Python pandas sum based on unique id

I have a dataframe in Pandas which has data like the below:
charge id
charge
payment
adjustment
type
123
$45
$30
0
P
123
$45
0
$10
A
124
$50
$20
0
P
I then use the pivot table to aggregate the data to produce sums / counts of various fields. However summing the charge is incorrect as the same charge can be in the data multiple times. Is it possible to sum the charge once based on the charge id field? So in this example charge id 123 would only be included once in the total.

Duplicate Analysis Using Python

I am a beginner in Python.
So far I have identified the duplicates using pandas lib but don't know how this will help me.
import pandas as pd
import numpy as np
dataframe = pd.read_csv("HKTW_INDIA_Duplicate_Account.csv")
dataframe.info()
name = dataframe["PARTY_NAME"duplicate_data=dataframe[name.isin(name[name.duplicated()])].sort_values("PARTY_NAME")
duplicate_data.head()
What I want: I have a set of data that is duplicated and I need to merge the duplicates based on certain conditions and need to populate the feedback in a new column.
I can do this manually also in Excel but the records are very high which will consume a lot of time. (More than 4,00,000 rows)
Primary Account ID Secondary Account ID Account Name Translated Name Created on Date Amount Today Amount Total Split Reamrks New ID
1234 245 Julia Julia 24-May-20 530 45 N
2345 Julia Julia 24-Sep-20 N
3456 42 Sara Sara 24-Aug-20 230 Y
4567 Sara Sara 24-Sep-20 Y
5678 Matt Matt 24-Jun-20 N
6789 Matt Matt 24-Sep-20 N
7890 58 Robert Robert 24-Feb-20 525 21 N
1937 Robert Robert 24-Sep-20 N
7854 55 Robert Robert 24-Jan-20 543 74 N
Conditions:
Only those accounts can be merged where we have "N" in Split Column and Amount_Total & Amount_Today is Blank.
Expected Output:
Value in Secondary_Account_ID or not.
Example: Row 2 does not have any Secondary Registry ID and does not have any value in Amount_Total & Amount_Todat but Row 1 has the value in Secondary_Account_ID, so in this case, Row 2 can be merged to Row 1 because both have the same name. In the remarks columns, it should give me Winner account have secondary id(row 2 & row 1) and copy the Account ID from row 1 and paste in (row 2 & row 1) (Column "New ID")
Expected Output:
If duplicate accounts have Amount_Total and Amount_Today then it should not be merged.
Expected Output:
If duplicate accounts do not have any value in Secondary_Account_ID then it will check for Amount_today or Amount_total column, if the value is there in these two columns then the account which does not have values in these two columns will be merged to another one.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for one account then that account will be considered as a winner account.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for more than one account then that account which has the maximum value in Amount_Total will be considered as winner account.
Expected Output:
If Secondary_Account_ID, Total_Amount, and Today_Amount is blank then it should consider the oldest account.
Expected Output:

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?

One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

creating custom ranks for a variable, groupby a another variable in Python

I have a dataframe like this:
ID Category Purchase
10001 1 7823
10001 2 7932
10002 1 10899
10003 1 79812
10003 2 80980
.....
There are many other IDs, who have a purchase amount for different categories. I want to rank each User in each category based on the purchase amount into say 10 groups(I may want to experiment with 5 or 20 ranks as well). So closest observations get same ranks. The pandas rank() function assigns the ranks= range of values in the category. But I want to limit number of ranks to 10. It will be equivalent of proc rank in SAS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.