how to use groupby() in this case? - python

let's say: there is a data frame:
country edition sports Athletes Medals
Germany 1990 Aquatics HAJOS, Alfred silver
Germany 1990 Aquatics HIRSCHMANN, Otto silver
Germany 1990 Aquatics DRIVAS, Dimitrios silver
US 2008 Athletics MALOKINIS, Ioannis silver
US 2008 Athletics HAJOS, Alfred silver
US 2009 Athletics CHASAPIS, Spiridon gold
France 2010 Athletics CHOROPHAS, Efstathios gold
France 2010 golf HAJOS, Alfred silver
France 2011 golf ANDREOU, Joannis silver
i want to find out Which edition distributed the most silver medals?
so i'm trying to solve it by groupby function in this way :
df.groupby('Edition')[df['Medal']=='Silver'].count().idxmax()
but its giving me
Key error = 'Columns not found: False, True'
can anyone tell me what is the issue?

So here's your pandas dataframe:
import pandas as pd
data = [
['Germany', 1990, 'Aquatics', 'HAJOS, Alfred', 'silver'],
['Germany', 1990, 'Aquatics', 'IRSCHMANN, Otto', 'silver'],
['Germany', 1990, 'Aquatics', 'DRIVAS, Dimitrios', 'silver'],
['US', 2008, 'Athletics', 'MALOKINIS, Ioannis', 'silver'],
['US', 2008, 'Athletics', 'HAJOS, Alfred', 'silver'],
['US', 2009, 'Athletics', 'CHASAPIS, Spiridon', 'gold'],
['France', 2010, 'Athletics', 'CHOROPHAS, Efstathios', 'gold'],
['France', 2010, 'golf', 'HAJOS, Alfred', 'silver'],
['France', 2011, 'golf', 'ANDREOU, Joannis', 'silver']
]
df = pd.DataFrame(data, columns = ['country', 'edition', 'sports', 'Athletes', 'Medals'])
print(df)
country edition sports Athletes Medals
0 Germany 1990 Aquatics HAJOS, Alfred silver
1 Germany 1990 Aquatics IRSCHMANN, Otto silver
2 Germany 1990 Aquatics DRIVAS, Dimitrios silver
3 US 2008 Athletics MALOKINIS, Ioannis silver
4 US 2008 Athletics HAJOS, Alfred silver
5 US 2009 Athletics CHASAPIS, Spiridon gold
6 France 2010 Athletics CHOROPHAS, Efstathios gold
7 France 2010 golf HAJOS, Alfred silver
8 France 2011 golf ANDREOU, Joannis silver
Now, you can simply filter silver medals then groupby edition (note that 'Edition' will throw a KeyError as opposed to 'edition') and finally get the count:
df[df.Medals == 'silver'].groupby('edition').count()['Medals'].idxmax()
>>> 1990

You can group by both columns to solve:
df[df['Medals'] == 'silver'].groupby(['edition','Medals'],as_index=True)['Athletes'].count().idxmax()
# Outcome:
(1990, 'silver')

df[df['Medal']=='silver'].groupby('edition').size().idxmax()
I tried this and it worked! i just replaced count() by size()

You should count per edition per medal:
>>> df = pd.DataFrame({'edition':[1990,1990,1990,2008,2008,2009,2010,2010,2011],'Medals':['silver','silver','silver','silver','silver','gold','gold','silver','silver']})
>>> df['count'] = ''
>>> df['count'] = df.groupby(['edition','Medals']).transform('count')
Then do the filtering on max():
>>> df = df[df['Medals'].isin(['silver'])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
3 2008 silver 2
4 2008 silver 2
7 2010 silver 1
8 2011 silver 1
>>> df = df[df['count'].isin([df['count'].max()])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
or
>>> df[df['count'].isin([df['count'].max()])]['Medals'].unique()[0]
'silver'

Related

Replace values from a dataframe with values from another with Pandas

I have two dataframes with identical columns, but different values and different number of rows.
import pandas as pd
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
'Price': [200, 100, 30,750,350,120,400,370]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df is the complete dataset but with some old values, whereas df2 only has the updated values. I want to replace all the values that are in df with the values in df2, all while keeping the values from df that aren't in df2.
So for example, in df, the value for Country = Japan, for Product = DEF, in Year = 2016, the Price should be updated from 470 to 400. The same for 2017, while 2018 and 2019 stay the same.
So far I have the following code that doesn't seem to work:
common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)
But this only updates df with the values from df2 and deletes everything else.
Expected output should look like this:
data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}
df3 = pd.DataFrame(data3)
Any suggestions on how I can do this?
You can use merge and update:
df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
how='left', suffixes=('_old', None)))
NB. the update is in place.
output:
Region Country Product Year Price
0 Africa South Africa ABC 2016 200.0
1 Africa South Africa ABC 2017 100.0
2 Africa South Africa ABC 2018 30.0
3 Africa South Africa ABC 2019 750.0
4 Africa South Africa XYZ 2016 350.0
5 Africa South Africa XYZ 2017 120.0
6 Africa South Africa XYZ 2018 0.0
7 Africa South Africa XYZ 2019 890.0
8 Asia Japan DEF 2016 400.0
9 Asia Japan DEF 2017 370.0
10 Asia Japan DEF 2018 0.0
11 Asia Japan DEF 2019 415.0
You can use
df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)
Region Country Product Year Price
0 Africa South Africa ABC 2016 200
1 Africa South Africa ABC 2017 100
2 Africa South Africa ABC 2018 30
3 Africa South Africa ABC 2019 750
4 Africa South Africa XYZ 2016 350
5 Africa South Africa XYZ 2017 120
6 Africa South Africa XYZ 2018 0
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 400
9 Asia Japan DEF 2017 370
10 Asia Japan DEF 2018 0
11 Asia Japan DEF 2019 415
I don't know if this is the case but what if df2 carry something not listed in df1? Here I'm adding a row to df2 with data Asia, Japan, DEF, 2020, 400.
import pandas as pd
import numpy as np
data1 = {
'Region': ['Africa','Africa','Africa','Africa',
'Africa','Africa','Africa','Africa',
'Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa',
'South Africa','South Africa','South Africa',
'South Africa','South Africa','South Africa',
'Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
'XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,
470,0,415]}
data2 = {
'Region': ['Africa','Africa','Africa','Africa','Africa',
'Africa','Asia','Asia', 'Asia'],
'Country': ['South Africa','South Africa','South Africa',
'South Africa','South Africa',
'South Africa','Japan','Japan', 'Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
'DEF', 'DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
'Price': [200, 100, 30,750,350,120,400,370, 400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Here I call df1 the first dataframe instead of df. Then I'm adding few step so we know exactly what is going on.
First I rename Price to Price_new in df2 then I'll do an outer join between the 2 dataframes.
df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)
which gives
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 500.0 200.0
1 Africa South Africa ABC 2017 400.0 100.0
2 Africa South Africa ABC 2018 0.0 30.0
3 Africa South Africa ABC 2019 450.0 750.0
4 Africa South Africa XYZ 2016 750.0 350.0
5 Africa South Africa XYZ 2017 0.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 500.0 400.0
9 Asia Japan DEF 2017 470.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 NaN 400.0
Now wherever Price_new is not null we update the Price column
df["Price"] = np.where(
df["Price_new"].notnull(),
df["Price_new"],
df["Price"])
The output being
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 200.0 200.0
1 Africa South Africa ABC 2017 100.0 100.0
2 Africa South Africa ABC 2018 30.0 30.0
3 Africa South Africa ABC 2019 750.0 750.0
4 Africa South Africa XYZ 2016 350.0 350.0
5 Africa South Africa XYZ 2017 120.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 400.0 400.0
9 Asia Japan DEF 2017 370.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 400.0 400.0
And you can evertually remove the extra column with
df = df.drop(columns=["Price_new"])
Note
The other solutions are great and I upvoted them. I added this to show you that sometime is better to use less specific code in order to have better control and maintainability in your code.

How to match identical columns' fields from different DataFrames in Python?

I need to match the identical fields of two columns from two separate dataframes, and rewrite the original dataframe, considering the another one.
So I have this original df:
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Vienna
3 Toyota Zurich
4 Renault Sydney
5 Ford Toronto
6 BMW Hamburg
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat San Francisco
11 Audi New York City
12 Ferrari Oslo
13 Volkswagen Stockholm
14 Lamborghini Singapore
15 Mercedes Lisbon
16 Jaguar Boston
And this new df:
Car Brand Current City
0 Tesla Amsterdam
1 Renault Paris
2 BMW Munich
3 Fiat Detroit
4 Audi Berlin
5 Ferrari Bruxelles
6 Lamborghini Rome
7 Mercedes Madrid
I need to match the car brands that are identical within the above two dataframes and write the new associate city in the original df, so the result should be this one: (so for example Tesla is now Amsterdam instead of Vienna)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
I tried with this code for mapping the columns and rewrite the field, but it doesn't really work and I cannot figure out how to make it work:
original_df['Original City'] = original_df['Car Brand'].map(dict(corrected_df[['Car Brand', 'Current City']]))
How to make it work ? Thanks a lot!!!!
P.S.: Code for df:
cars = ['Daimler', 'Mitsubishi','Tesla', 'Toyota', 'Renault', 'Ford','BMW', 'Audi Sport','Citroen', 'Chevrolet', 'Fiat', 'Audi', 'Ferrari', 'Volkswagen','Lamborghini', 'Mercedes', 'Jaguar']
cities = ['Chicago', 'LA', 'Vienna', 'Zurich', 'Sydney', 'Toronto', 'Hamburg', 'Helsinki', 'Dublin', 'Brisbane', 'San Francisco', 'New York City', 'Oslo', 'Stockholm', 'Singapore', 'Lisbon', 'Boston']
data = {'Original Car Brand': cars, 'Original City': cities}
original_df = pd.DataFrame (data, columns = ['Original Car Brand', 'Original City'])
---
cars = ['Tesla', 'Renault', 'BMW', 'Fiat', 'Audi', 'Ferrari', 'Lamborghini', 'Mercedes']
cities = ['Amsterdam', 'Paris', 'Munich', 'Detroit', 'Berlin', 'Bruxelles', 'Rome', 'Madrid']
data = {'Car Brand': cars, 'Current City': cities}
corrected_df = pd.DataFrame (data, columns = ['Car Brand', 'Current City'])
Use Series.map with repalce not matched values by original column by Series.fillna:
s = corrected_df.set_index('Car Brand')['Current City']
original_df['Original City'] = (original_df['Original Car Brand'].map(s)
.fillna(original_df['Original City']))
print (original_df)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
Your solution should be changed with convert both columns to numpy array before dict:
d = dict(corrected_df[['Car Brand','Current City']].to_numpy())
original_df['Original City'] = (original_df['Original Car Brand'].map(d)
.fillna(original_df['Original City']))
You can use set_index() and assign() method:
resultdf=original_df.set_index('Original Car Brand').assign(OriginalCity=corrected_df.set_index('Car Brand'))
Finally use fillna() method and reset_index() method:
resultdf=resultdf['OriginalCity'].fillna(resultdf['Original City']).reset_index()
Let us try update
df1 = df1.set_index('Original Car Brand')
df1.update(df2.set_index('Car Brand'))
df1 = df1.reset_index()
Merge can do the work as well
original_df['Original City'] = original_df.merge(corrected_df,left_on='Original Car Brand', right_on='Car Brand',how='left')['Current City'].fillna(original_df['Original City'])

how to compare the sum score of men to sum score of women to get the count of countries?

Let's say this is my data frame:
country Edition sports Athletes Medal Gender Score
Germany 1990 Aquatics HAJOS, Alfred gold M 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver M 2
Germany 1990 Aquatics DRIVAS, Dimitrios gold W 3
Germany 1990 Aquatics DRIVAS, Dimitrios silver W 2
US 2008 Athletics MALOKINIS, Ioannis gold M 1
US 2008 Athletics HAJOS, Alfred silver M 2
US 2009 Athletics CHASAPIS, Spiridon gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold M 3
France 2010 golf HAJOS, Alfred Bronze M 1
France 2011 golf ANDREOU, Joannis silver W 2
Spain 2011 golf BURKE, Thomas gold M 3
I am trying to find for how many countries the sum of men scores is equal to the sum of women scores?
I have tried the following:
sum_men = df[ df ['Gender']=='M'].groupby ( 'country' )[Score ].sum()
sum_women = df[ df ['Gender']=='W'].groupby ( 'country' )[Score ].sum()
Now i don't know how to compare this two and filter out no.of countries who have sum of men scores is equal to the sum of women scores.
can anyone please help me in this?
You can do this:
sum_men = df[df['Gender']=='M'].groupby ('Country' )['Score'].sum().reset_index() #watch the reset_index()
sum_women = df[df['Gender']=='W'].groupby ('Country' )['Score'].sum().reset_index()
new_df = sum_men.merge(sum_women, on="Country")
new_df['diff'] = new_df['Score_x'] - new_df['Score_y']
new_df
Country Score_x Score_y diff
0 France 4 5 -1
1 Germany 5 5 0
2 US 3 3 0
print(new_df[new_df['diff']==0])
Country Score_x Score_y diff
1 Germany 5 5 0
2 US 3 3 0
Not sure if you want to leave the ones who are equal or otherwise, but the same logic applies:
group = df.groupby(['country', 'Gender'])['Score'].sum().unstack()
not_equal = group[group.M != group.W]
filtered_df = df[df.country.isin(not_equal.index)]
Output:
country Edition sports Athletes Medal Gender Score score_sum
7 France 2010 Athletics CHOROPHAS, Efstathios gold W 3 5
8 France 2010 Athletics CHOROPHAS, Efstathios gold M 3 4
9 France 2010 golf HAJOS, Alfred Bronze M 1 4
10 France 2011 golf ANDREOU, Joannis silver W 2 5
11 Spain 2011 golf BURKE, Thomas gold M 3 3

how to assign Firstname of Athletes to new column 'Firstname' in most efficient way? [duplicate]

This question already has answers here:
How to split a dataframe string column into two columns?
(11 answers)
Closed 3 years ago.
Let's say: there is a data frame:
country Edition sports Athletes Medal Firstname
Germany 1990 Aquatics HAJOS, Alfred gold Alfred
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis
US 2008 Athletics HAJOS, Alfred silver Alfred
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios
France 2010 golf HAJOS, Alfred Bronze Alfred
France 2011 golf ANDREOU, Joannis silver Joannis
Spain 2011 golf BURKE, Thomas gold Thomas
can anyone tell me how to assign Firstname of Athletes to new column 'Firstname' in most efficient way? after adding new column df['Firstname']?
For your case:
df['Firstname'] = df['Athletes'].str.split(',\s+').str[1]

How to use df[df['Event_gender']== 'X'] correctly in this case?

Let's say: there is a data frame:
country Edition sports Athletes Medal Event_gender
Germany 1990 Aquatics HAJOS, Alfred gold X
Germany 1990 Aquatics HIRSCHMANN, Otto silver X
Germany 1990 Aquatics DRIVAS, Dimitrios silver M
US 2008 Athletics MALOKINIS, Ioannis gold M
US 2008 Athletics HAJOS, Alfred silver W
US 2009 Athletics CHASAPIS, Spiridon gold X
France 2010 Athletics CHOROPHAS, Efstathios gold X
France 2010 Athletics CHOROPHAS, Efstathios gold M
France 2010 golf HAJOS, Alfred silver M
France 2011 golf ANDREOU, Joannis silver W
Spain 2011 golf BURKE, Thomas gold W
I want to find out How many countries have won a gold medal with an event gender equal to 'X'?
so i'm trying to solve it but I am stuck. I did:
df[df['Medal']== 'gold']['country'].nunique()
and now I have the count of the countries who have won gold Medal but I am struggling in adding
df[df['Event_gender']== 'X']
to the above logic to get the final result. Can anyone help me out with this?
IIUC, it is
df.loc[(df['Medal']=='gold') & (df['Event_gender']=='X'), 'country'].nunique()

Categories

Resources