Joining two dataframes on matching values - python

I have a dataframe of transactions between people but its based of their ID number:
df =
First ID Second ID Total Currency
854 938 50 GBP
321 438 30 EUR
756 850 50 USD
etc...
I also have a second df which contains the IDs and the actual names of the people they are linked to.
ID_df =
ID code Name
321 John
850 David
etc...
I want to join the ID df onto the main dataframe so that i would have the names of the people. Ideally i would like it to look like:
df =
First ID First name Second ID Second name Total Currency
854 Steve 938 Mike 60 Eur
etc...

What's the issue with 2 back to back joins? (you can make a function out of it in case it becomes 2+)
df = df.merge(id_df,how='left',left_on='first_id',right_on='id_code')\
.rename(columns={'name':'first_name'})
df = df.merge(id_df,how='left',left_on='second_id',right_on='id_code')\
.rename(columns={'name':'second_name'})

Related

Duplicate Analysis Using Python

I am a beginner in Python.
So far I have identified the duplicates using pandas lib but don't know how this will help me.
import pandas as pd
import numpy as np
dataframe = pd.read_csv("HKTW_INDIA_Duplicate_Account.csv")
dataframe.info()
name = dataframe["PARTY_NAME"duplicate_data=dataframe[name.isin(name[name.duplicated()])].sort_values("PARTY_NAME")
duplicate_data.head()
What I want: I have a set of data that is duplicated and I need to merge the duplicates based on certain conditions and need to populate the feedback in a new column.
I can do this manually also in Excel but the records are very high which will consume a lot of time. (More than 4,00,000 rows)
Primary Account ID Secondary Account ID Account Name Translated Name Created on Date Amount Today Amount Total Split Reamrks New ID
1234 245 Julia Julia 24-May-20 530 45 N
2345 Julia Julia 24-Sep-20 N
3456 42 Sara Sara 24-Aug-20 230 Y
4567 Sara Sara 24-Sep-20 Y
5678 Matt Matt 24-Jun-20 N
6789 Matt Matt 24-Sep-20 N
7890 58 Robert Robert 24-Feb-20 525 21 N
1937 Robert Robert 24-Sep-20 N
7854 55 Robert Robert 24-Jan-20 543 74 N
Conditions:
Only those accounts can be merged where we have "N" in Split Column and Amount_Total & Amount_Today is Blank.
Expected Output:
Value in Secondary_Account_ID or not.
Example: Row 2 does not have any Secondary Registry ID and does not have any value in Amount_Total & Amount_Todat but Row 1 has the value in Secondary_Account_ID, so in this case, Row 2 can be merged to Row 1 because both have the same name. In the remarks columns, it should give me Winner account have secondary id(row 2 & row 1) and copy the Account ID from row 1 and paste in (row 2 & row 1) (Column "New ID")
Expected Output:
If duplicate accounts have Amount_Total and Amount_Today then it should not be merged.
Expected Output:
If duplicate accounts do not have any value in Secondary_Account_ID then it will check for Amount_today or Amount_total column, if the value is there in these two columns then the account which does not have values in these two columns will be merged to another one.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for one account then that account will be considered as a winner account.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for more than one account then that account which has the maximum value in Amount_Total will be considered as winner account.
Expected Output:
If Secondary_Account_ID, Total_Amount, and Today_Amount is blank then it should consider the oldest account.
Expected Output:

Aggregating rows in a data frame and eliminating duplicates

I want to merge rows in my df so I have one unique row per ID/Name with other values either summed (revenue) or concatenated (subject and product). However, where I am concatenating, I do not want duplicates to appear.
My df is similar to this:
ID Name Revenue Subject Product
123 John 125 Maths A
123 John 75 English B
246 Mary 32 History B
312 Peter 67 Maths A
312 Peter 39 Science A
I am using the following code to aggregate rows in my data frame
def f(x): return ' '.join(list(x))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)
This results in output like this:
ID Name Revenue Subject Product
123 John 200 Maths English A B
246 Mary 32 History B
312 Peter 106 Maths Science A A
How can I amend my code so that duplicates are removed in my concatenation? So in the example above the last row reads A in Product and not A A
You are very close. First apply set on the items before listing and joining them. This will return only unique items
def f(x): return ' '.join(list(set(x)))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)

Using Pandas to map results of a groupby.sum() to another dataframe?

I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200

Search for a matching string between two dataframes, and assign the matching column's name to the other dataframe with a function (Pandas)

"unique_receivers" is a Pandas dataframe with columns for unique transaction receivers, amounts and an empty column for categories which I want to fill with a function.
unique_receivers
Receiver Amount Category
144 SALE -18.93
141 TACO BELL -19.20
78 MCDONALDS -19.65
104 EXPRESS -20.00
154 SHOP -24.00
I want to fill the above dataframe's "Category" column based on its "Receiver" column's matches with search terms in another dataframe, "category_searchterms".
"category_searchterms" has categories as column names, and each category's column has its respective search terms.
Here's a sample of that dataframe:
categories
Groceries Electricity Fastfood
0 SHOP ELCOMPANY MCDONALDS
1 MARKET POWER SUBWAY
2 SALE PIZZA
I want to go through every row of the "unique_receivers"'s "Receiver" column, look for a match in the "categories" dataframe, take the matching column's name and assign that to the first dataframe's "Category" column.
I'm trying to do it with this function:
def add_category(searchterm):
unique_receivers["Category"] = (category_searchterms == searchterm).any().idxmax()
And then call it:
unique_receivers.apply(add_category(unique_receivers["Receiver"]), axis=1)
Problem:
TypeError: ("'NoneType' object is not callable", 'occurred at index 144')
Index 144 is the first row in "unique_receivers". If I now call the dataframe, every row has been filled with the first category:
unique_receivers
Receiver Amount Category
144 SALE -18.93 Groceries
141 TACO BELL -19.20 Groceries
78 MCDONALDS -19.65 Groceries
104 EXPRESS -20.00 Groceries
154 SHOP -24.00 Groceries
How can I get the real matching category to appear on each row's "Category" column? Thank you.
Here's a way using apply and a custom lambda function:
unique_receivers['Category'] = unique_receivers.Receiver.apply(lambda x:
''.join([i for i in categories.columns
if categories.loc[:,i].str.contains(x).any()])
or None)
Receiver Amount Category
144 SALE -18.93 Groceries
141 TACOBELL -19.20 None
78 MCDONALDS -19.65 Fastfood
104 EXPRESS -20.00 None
154 SHOP -24.00 Groceries
Or using pd.melt and right merge with df1:
categories.melt(var_name='Category').merge(unique_receivers,
left_on='value', right_on='Receiver',
how='right')\
[['Receiver','Amount','Category']]
Receiver Amount Category
0 SHOP -24.00 Groceries
1 SALE -18.93 Groceries
2 MCDONALDS -19.65 Fastfood
3 TACOBELL -19.20 None
4 EXPRESS -20.00 None
Does this work?
import pandas as pd
unique_receivers['Category'] = unique_receivers['Receivers'].apply(lambda x: pd.np.resize(categories.columns.values[pd.np.where(categories.isin([x]))[1]],1)[0])
The np.resize is to ensure you don't get an IndexError if no values are found

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Categories

Resources