Using Pandas to map results of a groupby.sum() to another dataframe? - python

I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value

Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200

Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200

Related

How can I get rows that compouse up to 90% of a sum?

I have two different dataframes, one containing the Net Revenue by SKU and Supplier and another one containing the stock of SKUs in each store. I need to get an average by Supplier of the stores that contains the SKUs that compouse up to 90% the net revenue of the supplier. It's a bit complicated but I will exemplify, and I hope it can make it clear. Please, note that if 3 SKUs compose 89% of the revenue, we need to consider another one.
Example:
Dataframe 1 - Net Revenue
Supplier
SKU
Net Revenue
UNILEVER
1111
10000
UNILEVER
2222
50000
UNILEVER
3333
500
PEPSICO
1313
680
PEPSICO
2424
10000
PEPSICO
2323
450
Dataframe 2 - Stock
Store
SKU
Stock
1
1111
1
1
2222
2
1
3333
1
2
1111
1
2
2222
0
2
3333
1
In this case, for UNILEVER, we need to discard SKU 3333 because its net revenue is not relevant (as 1111 and 2222 already compouse more than 90% of the total net revenue of UNILEVER). Coverage in this case will be 1.5 (we have 1111 in 2 stores and 2222 in one store: (1+2)/2).
Result is something like this:
Supplier
Coverage
UNILEVER
1.5
PEPSICO
...
Please, note that the real dataset has a different number of SKUs by supplier and a huge number of suppliers (around 150), so performance doesn't need to be PRIORITY but it has to be considered.
Thanks in advance, guys.
Calculate the cumulative sum grouping by Suppler and divide by the Supplier Total Revenue.
Then find each Supplier Revenue Threshold by getting the minimum Cumulative Revenue Percentage under 90%.
Then you can get the list of SKUs by Supplier and calculate the coverage.
import pandas as pd
df = pd.DataFrame([
['UNILEVER', '1111', 10000],
['UNILEVER', '2222', 50000],
['UNILEVER', '3333', 500],
['PEPSICO', '1313', 680],
['PEPSICO', '2424', 10000],
['PEPSICO', '2323', 450],
], columns=['Supplier', 'SKU', 'Net Revenue'])
total_revenue_by_supplier = df.groupby(df['Supplier']).sum().reset_index()
total_revenue_by_supplier.columns = ['Supplier', 'Total Revenue']
df = df.sort_values(['Supplier', 'Net Revenue'], ascending=[True, False])
df['cumsum'] = df.groupby(df['Supplier'])['Net Revenue'].transform(pd.Series.cumsum)
df = df.merge(total_revenue_by_supplier, on='Supplier')
df['cumpercentage'] = df['cumsum'] / df['Total Revenue']
min_before_threshold = df[df['cumpercentage'] >= 0.9][['Supplier', 'cumpercentage']].groupby('Supplier').min().reset_index()
min_before_threshold.columns = ['Supplier', 'Revenue Threshold']
df = df.merge(min_before_threshold, on='Supplier')
df = df[df['cumpercentage'] <= df['Revenue Threshold']][['Supplier', 'SKU', 'Net Revenue']]
df

dataframe pivot based on 3 columns

I have a data frame like shown below
customer
organization
currency
volume
revenue
Duration
Peter
XYZ Ltd
CNY, INR
20
3,000
01-Oct-2022
John
abc Ltd
INR
7
184
01-Oct-2022
Mary
aaa Ltd
USD
3
43
03-Oct-2022
John
bbb Ltd
THB
17
2,300
04-Oct-2022
Dany
ccc Ltd
CNY, INR , KRW
45
15,100
04-Oct-2022
If I pivot as shown below
df = pd.pivot_table(df, values=['runs', 'volume','revenue'],
index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0
)
level = 0 becomes volume for all Duration (level 1) revenue for all Duration duration for all Duration.
I would like to pivot by Duration as level 0 and volume, revenue as level 2.
How to achieve it?
Current output:
I would like to have date as level 0 and volume, revenue and runs under it.
You can use swaplevel like below in your current pivot code; try this;
df1 = df.pivot_table(index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0).swaplevel(0,1, axis=1).sort_index(axis=1)
Hope this Helps...

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.
You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Dynamic Sum in Pandas

I have a pandas dataframe of grocery transactions containing ['customer_id', 'date', 'item_code', and 'amount'].
I want to group multiple transactions from the same day into 1 transaction, with a sum of those individual transactions. For example, if I bought 3 items on 1-1-16, for $5, $10, and $15 each, I want that to be represented as a single row with a value of $30.
That part is a simple groupby
df.groupby(['customer_id', 'date'])['amount'].sum()
My problem is that I want to create a new column called "transaction_type" that assigns a code ('grpd') to a row if that row was grouped, and the corresponding value of item_code if it was not grouped.
So if I purchased 3 items on 1-1-16, but purchased a single new item on 1-2-16, I want my customer_id to show 2 rows in the dataframe. One for 1-1-16 with the custom 'grpd' value in the new transaction_type column, and one for 1-2-16 with the original value from the item_code column reproduced into the transaction_type column. So my dataframe would look like this in the end for my transactions:
customer_id date transaction_type amount
4231 1-1-16 grpd $30
4231 1-2-16 candy $5
Create dummy data:
df = pd.DataFrame({'customer_id':['4231']*4,'date':['1-1-2016','1-1-2016','1-1-2016','1-2-2016'],'items':['gum','candy','soda','candy'],'amount':[9,11,10,5]})
Input:
amount customer_id date items
0 9 4231 1-1-2016 gum
1 11 4231 1-1-2016 candy
2 10 4231 1-1-2016 soda
3 5 4231 1-2-2016 candy
Use .agg, np.where, and size:
df_out = (df.groupby(['customer_id','date'])
.agg({'items': lambda x: np.where(x.size > 1,'grpd',x.min()),'amount':'sum'})
.reset_index()
.rename(columns={'items':"transaction_type"}))
Output:
customer_id date amount transaction_type
0 4231 1-1-2016 30 grpd
1 4231 1-2-2016 5 candy
You can groupby the transaction_type too:
df.groupby(['date', 'customer_id', 'transaction_type'])['amount'].sum()

Categories

Resources