Pandas - Combining duplicate lines into one - python

So below I have an example of a DataFrame where throughout, there will be multiple instances like the SALES TAX EXPENSE lines, where if there is a duplicate, it will need to be converted into one line where the total of Trans_Amt should be Trans_Type C - Trans_Type D.
So for example in this DF, there should only be one line for SALES TAX EXPENSE, and the total should be -36239.65.
This happens multiple times throughout the DF, with multiple different ActNames. I'm looking for insight as to the most efficient way to do this transformation and apply it to any instance where this occurs.
Thank you!
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
1 SALES TAX EXPENSE 407 C 100000.00
30 DISCOUNTS 405 D -654.59
31 SALES TAX EXPENSE 407 D 136239.65

Group data by columns and take assign difference to Amt. Then drop duplicates.
df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff(periods=-1)).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName')
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
1 SALES TAX EXPENSE 407 C -36239.65
30 DISCOUNTS 405 D -654.59
Edit: Based on follow-up question. If the difference should be with the previous row, try
df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff()).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName', keep='last')
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
30 DISCOUNTS 405 D -654.59
31 SALES TAX EXPENSE 407 D 36239.65

Related

Pandas: Combining data items on multiple criteria

I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!
You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.
You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

Using Pandas to map results of a groupby.sum() to another dataframe?

I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

Pandas dataframe: count number of string value is in row for specific ID

I have the following use case:
I want to make a dataframe where for each row I have a column where I can see how many interactions there have been for this ID (user) in the categories. The hardest thing to me is that they can't be double counted, while a match in just one of the categories is enough to be counted as 1.
So for example I have:
richtingen id
0 Marketing, Sales 1110
1 Marketing, Sales 1110
2 Finance 220
3 Marketing, Engineering 1110
4 IT 3300
Now I want to create a third row where I can see how many times this ID has interacted with any of these categories in total. Each comma is a category on it's own so for example: "Marketing, Sales" are the two categories Marketing and Sales. To get a +1 you only need to have a match with another row where ID is the same and one of the categories matches, so for example for the index 0 it would be 3 (indexes 0, 1 and 3 match). The output data for the example should be:
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
The hard part for me seems to be that I can't all categories into new rows, as then you perhaps will start counting double. For example index 0 matches both Marketing and Sales of index 1 and I want it just to add 1, not 2.
The code I have so far is:
df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')
this only matches identical combination of categories though.
Other things I've tried:
- creating a new column with all vacancies split into an array:
df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))
and then the plan was to use something along this code in combination with groupby on id to count number of times it is true per item:
if any(t < 0 for t in x):
# do something
I couldn't get this to work either.
I tried splitting categories in new rows, or columns but then got an issue of double counting.
For example using code suggested:
df['richtingen'].str.split(', ',expand=True)
Gives me the following:
0 1 id
0 Marketing Sales 1110
1 Marketing Sales 1110
2 dDD None 220
3 Marketing Engineering 1110
4 ddsad None 3300
But then I will need to create code that goes over every row, then checks the ID, lists the values in the columns and checks if they are contained in any of the other columns (where ID is the same) and if one of them matches add 1 to freq. This code I suspect might be able with groupby, but am not sure, and can't figure it out.
(Solution suggested by Jezrael below):
If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame.
I think this solution perhaps is something similar to this, but at the moment it counts the total number of unique categories (not the unique number of interaction with categories). For example output at index 2 here is 2, while it should be 1 (as the user only interacted with the categories once).
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance, Accounting 220 2
3 Marketing, Engineering 1110 3
4 IT 3300 1
Hope I made myself clear and anyone knows how to fix this! In total there will be around 13 categories, always in one cell, but divided by a comma.
For msr_003:
id richtingen freq_x freq_y
0 220 Finance, IT 0 2
1 1110 Finance, IT 1 2
2 1110 Marketing, Sales 2 4
3 1110 Marketing, Sales 3 4
4 220 Marketing 4 1
5 220 Finance 5 2
6 1110 Marketing, Sales 6 4
7 3300 IT 7 1
8 1110 Marketing, IT 8 4
If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame:
s = (df.set_index('id')['richtingen']
.str.split(', ',expand=True)
.stack()
.groupby(level=0)
.nunique())
print (s)
id
220 1
1110 3
3300 1
dtype: int64
df['freq'] = df['id'].map(s)
print (df)
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
Detail:
print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id
1110 0 Marketing
1 Sales
0 Marketing
1 Sales
220 0 Finance
1110 0 Marketing
1 Engineering
3300 0 IT
dtype: object
I just modified your code as below.
count_unique = pd.DataFrame({'richtingen' : ["Finance, IT","Finance, IT", "Marketing, Sales", "Marketing, Sales", "Marketing","Finance", "Marketing, Sales", "IT", "Marketing, IT"], 'id': [220,1110,1110, 1110,220, 220,1110,3300,1110]})
count_unique['freq'] = list(range(0,len(count_unique)))
grp = count_unique.groupby(['richtingen', 'id']).agg({'freq' : 'count' }).reset_index(level = [0,1])
pd.merge(count_unique,grp, on = ('richtingen','id'), how = 'left')
I am not that into pandas. But I think you may have some luck by adding 13 new columns based on richtingen each column containing 1 or no category . You can use dataframe.apply or a similar function to compute the values when creating the columns.
Then you can take it from there by ORing stuff...

Categories

Resources