Pandas: Combining data items on multiple criteria

Pandas: Combining data items on multiple criteria - python

I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!

You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1

Related

Nested Group by with count and average

I have this dataset:
import pandas as pd
d = {'Agreements': ["Rome", "NewYork", "Paris", "Tokyo"], 'Year': [2012, 2012, 2013, 2013],
'Provision1': [1, 1, 1, 1], 'Provision2': [1, 1, 0, 1], 'Provision3': [0, 1, 1, 0]}
df = pd.DataFrame(data=d)
Agreements
Year
Provision1
Provision2
Provision3
Rome
2012
1
1
0
NewYork
2012
1
1
1
Paris
2013
1
0
1
Tokyo
2013
1
1
0
I would like to group by to obtain as output:
Year
Count Agreements per Year
Count Provisions per Year
Average Provision per Year
2012
2
5
2.5
2013
2
4
2
I have tried with df.groupby('Year')['Agreements'].count().reset_index(name='counts') but I do not know how to expand it to obtain the output I desire. Thanks

Try:
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
2.5
5
2013
2
2
4
EDIT: To add column with the count of "Agreements with at least one Provision":
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1),
AggreementsWithAtLeastOneProvision=df.loc[:, "Provision1":"Provision3"].any(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "AggreementsWithAtLeastOneProvision": "sum", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Sum AggreementsWithAtLeastOneProvision per Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
3
1.66667
5
2013
2
2
2
4
Input data in this case was:
Agreements Year Provision1 Provision2 Provision3
0 Bratislava 2012 0 0 0
1 Rome 2012 1 1 0
2 NewYork 2012 1 1 1
3 Paris 2013 1 0 1
4 Tokyo 2013 1 1 0

You can use melt and agg:
out = (df.assign(Average=lambda x: x.filter(like='Provision').sum(axis=1))
.melt(id_vars=['Year', 'Agreements', 'Average'], var_name='Provision').groupby('Year')
.agg(**{'Count Agreements per Year': ('Agreements', 'nunique'),
'Count Provisions per Year': ('value', 'sum'),
'Average Provision per Year': ('Average', 'mean')})
.reset_index())
Output:
>>> out
Year Count Agreements per Year Count Provisions per Year Average Provision per Year
0 2012 2 5 2.5
1 2013 2 4 2.0

There are multiple ways to approach this, here is a simple one:
First, combine all "Provision" columns into one column:
# combining all Provision columns into one column
df['All_Provisions'] = df['Provision1'] + df['Provision2'] + df['Provision3']
Second, aggregate the columns using .agg, which takes your columns and desired aggregations as a dictionary. In our case, we want to:
count "Agreements" column
sum "All_Provisions" column
average "All_Provisions" column
We can do it like:
# aggregating columns
df = df.groupby('Year', as_index=False).agg({'Agreements':'count', 'All_Provisions':['sum', 'mean']})
Finally, rename your columns:
# renaming columns
df.columns = ['Year','Count Agreements per Year','Count Provisions per Year','Average Provision per Year']
Hope this helps :)

Another possible solution:
(df.iloc[:,1:].set_index('Year').stack().to_frame()
.pivot_table(index='Year', values=0, aggfunc=[lambda x:
x.count() / 3, 'sum', lambda x: x.sum()/(x.count() / 3)])
.set_axis(['Count Agreements per Year', 'Count Provisions per Year',
'Average Provision per Year'], axis=1).reset_index())
Using a numpy approach:
a = df.iloc[:,1:].values
colnames = ['Year','Count Agreements per Year',
'Count Provisions per Year', 'Average Provision per Year' ]
pd.DataFrame(
np.vstack(
[[x[0,0], np.size(x[:,1:])/3, np.sum(x[:,1:]),
np.sum(x[:,1:])/(np.size(x[:,1:])/3)]
for x in [a[np.where(a[:, 0] == val)] for val in np.unique(a[:, 0])]]),
columns=colnames).convert_dtypes()
Output:
Year Count Agreements per Year Count Provisions per Year \
0 2012 2 5
1 2013 2 4
Average Provision per Year
0 2.5
1 2.0

Getting maximum counts of a column in grouped dataframe

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?

Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c

You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A

Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner

Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A

I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.

You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

how to select a group from groupby dataframe using pandas

I have a dataframe with multilevel index (company, year) that grouped by mean, looks like this:
company year mean salary
ABC 2018 3000
2019 3400
LOL 2018 1200
2019 3500
I want to select the data belongs to "LOL", my desired outcome would be:
company year mean salary
LOL 2018 1200
2019 3500
Is there a way I can only select a certain group? I tried to use .filter function on dataframe but I was only able to apply it to rows such as (lambda x: x > 1000) but not for index value.
Any advice will be appreciated!

Use DataFrame.xs with drop_level=False for avoid removed first level:
df1 = df.xs('LOL', drop_level=False)
Or filter by first level with Index.get_level_values:
df1 = df[df.index.get_level_values(0) == 'LOL']
print (df1)
mean salary
company year
LOL 2018 1200
2019 3500

Pandas - Combining duplicate lines into one

So below I have an example of a DataFrame where throughout, there will be multiple instances like the SALES TAX EXPENSE lines, where if there is a duplicate, it will need to be converted into one line where the total of Trans_Amt should be Trans_Type C - Trans_Type D.
So for example in this DF, there should only be one line for SALES TAX EXPENSE, and the total should be -36239.65.
This happens multiple times throughout the DF, with multiple different ActNames. I'm looking for insight as to the most efficient way to do this transformation and apply it to any instance where this occurs.
Thank you!
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
1 SALES TAX EXPENSE 407 C 100000.00
30 DISCOUNTS 405 D -654.59
31 SALES TAX EXPENSE 407 D 136239.65

Group data by columns and take assign difference to Amt. Then drop duplicates.
df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff(periods=-1)).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName')
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
1 SALES TAX EXPENSE 407 C -36239.65
30 DISCOUNTS 405 D -654.59
Edit: Based on follow-up question. If the difference should be with the previous row, try
df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff()).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName', keep='last')
ActName ActCode Trans_Type Trans_Amt
0 SALES 401 C 2082748.85
30 DISCOUNTS 405 D -654.59
31 SALES TAX EXPENSE 407 D 36239.65

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Combining data items on multiple criteria - python

Related

Nested Group by with count and average

Getting maximum counts of a column in grouped dataframe

Comparing two dataframes without duplicates

how to select a group from groupby dataframe using pandas

Pandas - Combining duplicate lines into one

Categories

Resources