Getting maximum counts of a column in grouped dataframe - python

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?

Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c

You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A

Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner

Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A

I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

Related

Nested Group by with count and average

I have this dataset:
import pandas as pd
d = {'Agreements': ["Rome", "NewYork", "Paris", "Tokyo"], 'Year': [2012, 2012, 2013, 2013],
'Provision1': [1, 1, 1, 1], 'Provision2': [1, 1, 0, 1], 'Provision3': [0, 1, 1, 0]}
df = pd.DataFrame(data=d)
Agreements
Year
Provision1
Provision2
Provision3
Rome
2012
1
1
0
NewYork
2012
1
1
1
Paris
2013
1
0
1
Tokyo
2013
1
1
0
I would like to group by to obtain as output:
Year
Count Agreements per Year
Count Provisions per Year
Average Provision per Year
2012
2
5
2.5
2013
2
4
2
I have tried with df.groupby('Year')['Agreements'].count().reset_index(name='counts') but I do not know how to expand it to obtain the output I desire. Thanks
Try:
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
2.5
5
2013
2
2
4
EDIT: To add column with the count of "Agreements with at least one Provision":
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1),
AggreementsWithAtLeastOneProvision=df.loc[:, "Provision1":"Provision3"].any(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "AggreementsWithAtLeastOneProvision": "sum", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year
Sum AggreementsWithAtLeastOneProvision per Year
Count Agreements per Year
Average Provisions per Year
Sum Provisions per Year
2012
2
3
1.66667
5
2013
2
2
2
4
Input data in this case was:
Agreements Year Provision1 Provision2 Provision3
0 Bratislava 2012 0 0 0
1 Rome 2012 1 1 0
2 NewYork 2012 1 1 1
3 Paris 2013 1 0 1
4 Tokyo 2013 1 1 0
You can use melt and agg:
out = (df.assign(Average=lambda x: x.filter(like='Provision').sum(axis=1))
.melt(id_vars=['Year', 'Agreements', 'Average'], var_name='Provision').groupby('Year')
.agg(**{'Count Agreements per Year': ('Agreements', 'nunique'),
'Count Provisions per Year': ('value', 'sum'),
'Average Provision per Year': ('Average', 'mean')})
.reset_index())
Output:
>>> out
Year Count Agreements per Year Count Provisions per Year Average Provision per Year
0 2012 2 5 2.5
1 2013 2 4 2.0
There are multiple ways to approach this, here is a simple one:
First, combine all "Provision" columns into one column:
# combining all Provision columns into one column
df['All_Provisions'] = df['Provision1'] + df['Provision2'] + df['Provision3']
Second, aggregate the columns using .agg, which takes your columns and desired aggregations as a dictionary. In our case, we want to:
count "Agreements" column
sum "All_Provisions" column
average "All_Provisions" column
We can do it like:
# aggregating columns
df = df.groupby('Year', as_index=False).agg({'Agreements':'count', 'All_Provisions':['sum', 'mean']})
Finally, rename your columns:
# renaming columns
df.columns = ['Year','Count Agreements per Year','Count Provisions per Year','Average Provision per Year']
Hope this helps :)
Another possible solution:
(df.iloc[:,1:].set_index('Year').stack().to_frame()
.pivot_table(index='Year', values=0, aggfunc=[lambda x:
x.count() / 3, 'sum', lambda x: x.sum()/(x.count() / 3)])
.set_axis(['Count Agreements per Year', 'Count Provisions per Year',
'Average Provision per Year'], axis=1).reset_index())
Using a numpy approach:
a = df.iloc[:,1:].values
colnames = ['Year','Count Agreements per Year',
'Count Provisions per Year', 'Average Provision per Year' ]
pd.DataFrame(
np.vstack(
[[x[0,0], np.size(x[:,1:])/3, np.sum(x[:,1:]),
np.sum(x[:,1:])/(np.size(x[:,1:])/3)]
for x in [a[np.where(a[:, 0] == val)] for val in np.unique(a[:, 0])]]),
columns=colnames).convert_dtypes()
Output:
Year Count Agreements per Year Count Provisions per Year \
0 2012 2 5
1 2013 2 4
Average Provision per Year
0 2.5
1 2.0

Pandas: Combining data items on multiple criteria

I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!
You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1

Comparing two dataframes without duplicates

I have two similar structured dataframes that represent two periods in time, say Jul 2020 and Aug 2020. The data in it is forecasted and/or realised revenue data from several company sources like CRM and accouting application. The columns contain data on clients, product, quantity, price, revenue, period, etc. Now I want to see what happened between these to months by comparing the two dataframes.
I tried to do this by renaming some of the columns like quantity, price and revenue and then merge the two dataframes on client, product and period. After that I calculate the difference on the quanity, price and revenue.
However I run into a problem... Suppose one specific customer has closed a contract with us to purchase two specific products (abc & xyz) every month for the next two years. That means that in our July forecast we can include these two items as revenue. In reality this list is much longer with other contracts and also expected revenue that is in the weighted pipeline.
This is a small extract from the total forecast for our specific client.
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
Now suppose this client descides to purchase a second product xyz and we get another contract for this. Than it looks like this for July:
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 contracted 1 100 100
1 A xyz 2020-07 contracted 1 50 50
2 A xyz 2020-07 contracted 1 50 50
Now suppose we are a month later and from our accounting sytem we drew the realised revenue that looks like this (so what we forecasted became reality):
Client Product Period Stage Qty Price Rev
0 A abc 2020-07 realised 1 100 100
1 A xyz 2020-07 realised 2 50 100
And now I want to compare them by merging the two df's after renaming some of the columns.
def rename_column(df_name, col_name, first_forecast_period):
col_name = df_name.rename(columns={col_name: col_name + '_' + first_forecast_period}, inplace=True)
return df_name
rename_column(df_1, 'Stage', '1')
rename_column(df_1, 'Price', '1')
rename_column(df_1, 'Qty', '1')
rename_column(df_1, 'Rev', '1')
rename_column(df_2, 'Stage', '2')
rename_column(df_2, 'Price', '2')
rename_column(df_2, 'Qty', '2')
rename_column(df_2, 'Rev', '2')
result_1 = pd.merge(df_1, df_2, how ='outer')
And then some math to get the differences:
result_1['Qty_diff'] = result1['Quantity_2'] - result1['Quantity_1']
result_1['Price_diff'] = result1['Price_2'] - result1['Price_1']
result_1['Rev_diff'] = result1['Rev_2'] - result1['Rev_1']
This results in:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
So, the problem is that in the third line the realised part is included a second time. Since the forecast and the reality are the same, the outcome should have been:
Client Product Period Stage_1 Qty_1 Price_1 Rev_1 Stage_2 Qty_2 Price_2 Rev_2 Qty_diff Price_diff Rev_diff
0 A abc 2020-07 contracted 1 100 100 realised 1 100 100 0 0 0
1 A xyz 2020-07 contracted 1 50 50 realised 2 50 100 1 0 50
2 A xyz 2020-07 contracted 1 50 50 realised 0 0 0 -1 0 -50
And therefor I get a total revenue difference of 100 (+50 and +50), instead of 0 (+50 and -50). Is there any way this can be solved with merging the two DF's or do I need to start thinking in another direction. If so, then any suggestions would be helpful! Thanks.
You should probably get the totals for client-product-period on both dfs to be safe. Assuming all rows in df_1 are 'contracted', you can do:
df_1 = (df_1.groupby(['Client', 'Prooduct', 'Period'])
.agg({'Stage': 'first', 'Qty': sum, 'Price': 'first', 'Rev': sum})
# if price can vary between rows of the same product-client
# .agg({'Stage': 'first', 'Qty': sum, 'Price': 'mean', 'Rev': sum})
# same for df_2
Now you can merge both dfs with:
df_merged = df_1.merge(df_2)
The result will add suffixes to duplicate columns, _x and _y for df_1 and df_2 respectively.

Join dataframes based on partial string-match between columns

I have a dataframe which I want to compare if they are present in another df.
after_h.sample(10, random_state=1)
movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
I want to compare if the above movies are present in another df.
FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560
I want something like this as my final output:
FILM votes
0 Max Steel 560
There are two ways:
get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of:
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year).
.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
(Acknowledging much help from #user3483203 here and in Python chat room)
Code to recreate dataframes:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
smci's option 1 is nearly there, the following worked for me:
df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))
Explanation:
Create a Votes column in df1
Apply a lambda to every movie string in df1
The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title
Select the Votes column of the resulting subset of df2
Take the first value in this column with any(0)

Pandas Groupby with multiple columns selecting rows with full range of values

I am working with a pandas dataframe. From the code:
contracts.groupby(['State','Year'])['$'].mean()
I have a pandas groupby object with two group layers: State and Year.
State / Year / $
NY 2009 5
2010 10
2011 5
2012 15
NJ 2009 2
2012 12
DE 2009 1
2010 2
2011 3
2012 6
I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?
After grouping by State and Year and taking the mean,
means = contracts.groupby(['State', 'Year'])['$'].mean()
you could groupby the State alone, and use filter to keep the desired groups:
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15
states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
'State': np.random.choice(states, size=N),
'Year': np.random.choice(years, size=N),
'$': np.random.randint(10, size=N)})
means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
print(result)
yields
State Year
DE 2009 8
2010 5
2011 3
2012 6
NY 2009 2
2010 1
2011 5
2012 9
Name: $, dtype: int64
Alternatively, you could filter first and then take the mean:
filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()
but playing with various examples suggest this is typically slower than taking the mean, then filtering.

Categories

Resources