Nested Group by with count and average - python

I have this dataset:
import pandas as pd
d = {'Agreements': ["Rome", "NewYork", "Paris", "Tokyo"], 'Year': [2012, 2012, 2013, 2013],
'Provision1': [1, 1, 1, 1], 'Provision2': [1, 1, 0, 1], 'Provision3': [0, 1, 1, 0]}
df = pd.DataFrame(data=d)
Agreements
Year
Provision1
Provision2
Provision3
Rome
2012
1
1
0
NewYork
2012
1
1
1
Paris
2013
1
0
1
Tokyo
2013
1
1
0
I would like to group by to obtain as output:
Year
Count Agreements per Year
Count Provisions per Year
Average Provision per Year
2012
2
5
2.5
2013
2
4
2
I have tried with df.groupby('Year')['Agreements'].count().reset_index(name='counts') but I do not know how to expand it to obtain the output I desire. Thanks

Try:
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
2.5
5
2013
2
2
4
EDIT: To add column with the count of "Agreements with at least one Provision":
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1),
AggreementsWithAtLeastOneProvision=df.loc[:, "Provision1":"Provision3"].any(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "AggreementsWithAtLeastOneProvision": "sum", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Sum AggreementsWithAtLeastOneProvision per Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
3
1.66667
5
2013
2
2
2
4
Input data in this case was:
Agreements Year Provision1 Provision2 Provision3
0 Bratislava 2012 0 0 0
1 Rome 2012 1 1 0
2 NewYork 2012 1 1 1
3 Paris 2013 1 0 1
4 Tokyo 2013 1 1 0

You can use melt and agg:
out = (df.assign(Average=lambda x: x.filter(like='Provision').sum(axis=1))
.melt(id_vars=['Year', 'Agreements', 'Average'], var_name='Provision').groupby('Year')
.agg(**{'Count Agreements per Year': ('Agreements', 'nunique'),
'Count Provisions per Year': ('value', 'sum'),
'Average Provision per Year': ('Average', 'mean')})
.reset_index())
Output:
>>> out
Year Count Agreements per Year Count Provisions per Year Average Provision per Year
0 2012 2 5 2.5
1 2013 2 4 2.0

There are multiple ways to approach this, here is a simple one:
First, combine all "Provision" columns into one column:
# combining all Provision columns into one column
df['All_Provisions'] = df['Provision1'] + df['Provision2'] + df['Provision3']
Second, aggregate the columns using .agg, which takes your columns and desired aggregations as a dictionary. In our case, we want to:
count "Agreements" column
sum "All_Provisions" column
average "All_Provisions" column
We can do it like:
# aggregating columns
df = df.groupby('Year', as_index=False).agg({'Agreements':'count', 'All_Provisions':['sum', 'mean']})
Finally, rename your columns:
# renaming columns
df.columns = ['Year','Count Agreements per Year','Count Provisions per Year','Average Provision per Year']
Hope this helps :)

Another possible solution:
(df.iloc[:,1:].set_index('Year').stack().to_frame()
.pivot_table(index='Year', values=0, aggfunc=[lambda x:
x.count() / 3, 'sum', lambda x: x.sum()/(x.count() / 3)])
.set_axis(['Count Agreements per Year', 'Count Provisions per Year',
'Average Provision per Year'], axis=1).reset_index())
Using a numpy approach:
a = df.iloc[:,1:].values
colnames = ['Year','Count Agreements per Year',
'Count Provisions per Year', 'Average Provision per Year' ]
pd.DataFrame(
np.vstack(
[[x[0,0], np.size(x[:,1:])/3, np.sum(x[:,1:]),
np.sum(x[:,1:])/(np.size(x[:,1:])/3)]
for x in [a[np.where(a[:, 0] == val)] for val in np.unique(a[:, 0])]]),
columns=colnames).convert_dtypes()
Output:
Year Count Agreements per Year Count Provisions per Year \
0 2012 2 5
1 2013 2 4
Average Provision per Year
0 2.5
1 2.0

Related

Python/increase code efficiency about multiple columns filter

I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019.
But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract,
year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas?
many thanks:)
##############################
here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})
pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019
Two additional points:
In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.

Pandas: Combining data items on multiple criteria

I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!
You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1

how to find the number of rows in a column that are above the mean?

I have a dataset and among the columns there are column A that have the release year of products and column B that have the sales of each product.
I want to know how many product have sales above the mean for each year.
The dataset is a pandas dataframe.
Thank you and I hope my question is clear
Compute yearly averages with groupby.transform() and compare them against the individual sales, e.g.:
df = pd.DataFrame({'product': np.random.choice(['foo','bar'], size=10), 'year': np.random.choice([2019,2020,2021], size=10), 'sales': np.random.randint(10000, size=10)})
# product year sales
# 0 foo 2019 7507
# 1 bar 2019 9186
# 2 foo 2021 6234
# 3 foo 2021 7375
# 4 bar 2020 9934
# 5 foo 2021 6403
# 6 foo 2021 7729
# 7 foo 2021 1875
# 8 bar 2020 7148
# 9 foo 2019 8163
df['above_mean'] = df.sales > df.groupby(['product','year']).sales.transform('mean')
df.groupby('year', as_index=False).above_mean.sum()
# year above_mean
# 0 2019 1
# 1 2020 1
# 2 2021 4

Getting maximum counts of a column in grouped dataframe

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?
Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c
You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A
Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner
Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A
I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.
You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

Categories

Resources