Replace values from a dataframe with values from another with Pandas

Replace values from a dataframe with values from another with Pandas - python

I have two dataframes with identical columns, but different values and different number of rows.
import pandas as pd
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
'Price': [200, 100, 30,750,350,120,400,370]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df is the complete dataset but with some old values, whereas df2 only has the updated values. I want to replace all the values that are in df with the values in df2, all while keeping the values from df that aren't in df2.
So for example, in df, the value for Country = Japan, for Product = DEF, in Year = 2016, the Price should be updated from 470 to 400. The same for 2017, while 2018 and 2019 stay the same.
So far I have the following code that doesn't seem to work:
common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)
But this only updates df with the values from df2 and deletes everything else.
Expected output should look like this:
data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}
df3 = pd.DataFrame(data3)
Any suggestions on how I can do this?

You can use merge and update:
df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
how='left', suffixes=('_old', None)))
NB. the update is in place.
output:
Region Country Product Year Price
0 Africa South Africa ABC 2016 200.0
1 Africa South Africa ABC 2017 100.0
2 Africa South Africa ABC 2018 30.0
3 Africa South Africa ABC 2019 750.0
4 Africa South Africa XYZ 2016 350.0
5 Africa South Africa XYZ 2017 120.0
6 Africa South Africa XYZ 2018 0.0
7 Africa South Africa XYZ 2019 890.0
8 Asia Japan DEF 2016 400.0
9 Asia Japan DEF 2017 370.0
10 Asia Japan DEF 2018 0.0
11 Asia Japan DEF 2019 415.0

You can use
df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)
Region Country Product Year Price
0 Africa South Africa ABC 2016 200
1 Africa South Africa ABC 2017 100
2 Africa South Africa ABC 2018 30
3 Africa South Africa ABC 2019 750
4 Africa South Africa XYZ 2016 350
5 Africa South Africa XYZ 2017 120
6 Africa South Africa XYZ 2018 0
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 400
9 Asia Japan DEF 2017 370
10 Asia Japan DEF 2018 0
11 Asia Japan DEF 2019 415

I don't know if this is the case but what if df2 carry something not listed in df1? Here I'm adding a row to df2 with data Asia, Japan, DEF, 2020, 400.
import pandas as pd
import numpy as np
data1 = {
'Region': ['Africa','Africa','Africa','Africa',
'Africa','Africa','Africa','Africa',
'Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa',
'South Africa','South Africa','South Africa',
'South Africa','South Africa','South Africa',
'Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
'XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,
470,0,415]}
data2 = {
'Region': ['Africa','Africa','Africa','Africa','Africa',
'Africa','Asia','Asia', 'Asia'],
'Country': ['South Africa','South Africa','South Africa',
'South Africa','South Africa',
'South Africa','Japan','Japan', 'Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
'DEF', 'DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
'Price': [200, 100, 30,750,350,120,400,370, 400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Here I call df1 the first dataframe instead of df. Then I'm adding few step so we know exactly what is going on.
First I rename Price to Price_new in df2 then I'll do an outer join between the 2 dataframes.
df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)
which gives
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 500.0 200.0
1 Africa South Africa ABC 2017 400.0 100.0
2 Africa South Africa ABC 2018 0.0 30.0
3 Africa South Africa ABC 2019 450.0 750.0
4 Africa South Africa XYZ 2016 750.0 350.0
5 Africa South Africa XYZ 2017 0.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 500.0 400.0
9 Asia Japan DEF 2017 470.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 NaN 400.0
Now wherever Price_new is not null we update the Price column
df["Price"] = np.where(
df["Price_new"].notnull(),
df["Price_new"],
df["Price"])
The output being
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 200.0 200.0
1 Africa South Africa ABC 2017 100.0 100.0
2 Africa South Africa ABC 2018 30.0 30.0
3 Africa South Africa ABC 2019 750.0 750.0
4 Africa South Africa XYZ 2016 350.0 350.0
5 Africa South Africa XYZ 2017 120.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 400.0 400.0
9 Asia Japan DEF 2017 370.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 400.0 400.0
And you can evertually remove the extra column with
df = df.drop(columns=["Price_new"])
Note
The other solutions are great and I upvoted them. I added this to show you that sometime is better to use less specific code in order to have better control and maintainability in your code.

Related

How to remove rows from dataframe where data from another dataframe DOESN'T match

So I have two dataframes:
gp_df = CCA3 Country/Territory year GDP_USD
2662 AFG Afghanistan 1970 1.748887e+09
2661 AFE Africa Eastern and Southern 1970 4.486261e+10
2663 AFW Africa Western and Central 1970 2.350461e+10
2665 ALB Albania 1970 NaN
2720 DZA Algeria 1970 4.863487e+09
... ... ... ... ...
16156 PSE West Bank and Gaza 2020 1.553170e+10
16219 WLD World 2020 8.490680e+13
16222 YEM Yemen, Rep. 2020 1.884051e+10
16224 ZMB Zambia 2020 1.811063e+10
16225 ZWE Zimbabwe 2020 1.805117e+10
pp_df = CCA3 Country/Territory Continent 2020 Population 1970 Population
0 AFG Afghanistan Asia 38972230 10752971
1 ALB Albania Europe 2866849 2324731
2 DZA Algeria Africa 43451666 13795915
3 ASM American Samoa Oceania 46189 27075
4 AND Andorra Europe 77700 19860
.. ... ... ... ... ...
229 WLF Wallis and Futuna Oceania 11655 9377
230 ESH Western Sahara Africa 556048 76371
231 YEM Yemen Asia 32284046 6843607
232 ZMB Zambia Africa 18927715 4281671
233 ZWE Zimbabwe Africa 15669666 5202918
I want to find a way to remove all the Countrys/Territorys from gp_df that aren't in pp_df. I've already tried using .drop() and np.where() to try and locate the duplicates then drop them but I can't seem to get the syntax correct.

Does this work...
gp_df[~gp_df['Country/Territory'].isin(pp_df['Country/Territory'])]

Dataframes to work on
gp_df = pd.DataFrame({'CCA3': ["AFG" , "AFE", "AFW", "ALB", "DZA"],
'Country/Territory': ["Afghanistan" , "Africa Eastern and Southern",
"Africa Western and Central", "Albania", "Algeria"],
'year': [1970, 1970, 1970, 1970, 1970],
'GDP_USD':[1.748887e+09, 4.486261e+10, 2.350461e+10, pd.NA, 4.863487e+09],})
# CCA3 Country/Territory year GDP_USD
# 0 AFG Afghanistan 1970 1748887000.0
# 1 AFE Africa Eastern and Southern 1970 44862610000.0
# 2 AFW Africa Western and Central 1970 23504610000.0
# 3 ALB Albania 1970 <NA>
# 4 DZA Algeria 1970 4863487000.0
pp_df = pd.DataFrame({'CCA3': ["AFG" , "ALB", "DZA", "ASM", "AND"],
'Country/Territory': ["Afghanistan" , "Albania",
"Algeria", "American Samoa", "Andorra"],
'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"],
'2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
'1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
})
# CCA3 Country/Territory Continent 2020 Population 1970 Population
# 0 AFG Afghanistan Asia 38972230 10752971
# 1 ALB Albania Europe 2866849 2324731
# 2 DZA Algeria Africa 43451666 13795915
# 3 ASM American Samoa Oceania 46189 27075
# 4 AND Andorra Europe 77700 19860
Complete script for checkings
import pandas as pd
gp_df = pd.DataFrame({'CCA3': ["AFG" , "AFE", "AFW", "ALB", "DZA"],
'Country/Territory': ["Afghanistan" , "Africa Eastern and Southern",
"Africa Western and Central", "Albania", "Algeria"],
'year': [1970, 1970, 1970, 1970, 1970],
'GDP_USD':[1.748887e+09, 4.486261e+10, 2.350461e+10, pd.NA, 4.863487e+09],})
pp_df = pd.DataFrame({'CCA3': ["AFG" , "ALB", "DZA", "ASM", "AND"],
'Country/Territory': ["Afghanistan" , "Albania",
"Algeria", "American Samoa", "Andorra"],
'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"],
'2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
'1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
})
uKeys_1 = pp_df['Country/Territory'].unique()
r = gp_df[gp_df['Country/Territory'].isin(uKeys_1)]
print(r)
Result
# CCA3 Country/Territory year GDP_USD
# 0 AFG Afghanistan 1970 1748887000.0
# 3 ALB Albania 1970 <NA>
# 4 DZA Algeria 1970 4863487000.0

sum rows from two different data frames based on the value of columns

I have two data frames
df1
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 100
1 11 2023 Lyon Chicago,Paris 200
2 11 2023 Berlin Paris 300
3 12 2022 Newyork Chicago 150
4 12 2022 Lyon Chicago,Paris 250
5 12 2022 Berlin Paris 400
df2
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
2 12 2022 Chicago 300
3 12 2022 Paris 350
I would like for each group having the same ID & Year:
to add the column Sales from df2 to Sales in df1 where Primary_Location in df2 appear (contained) in Secondary_Location in df1.
For example: For ID=11 & Year=2023, Sales for Lyon would be added to Sales for Chicago & Sales for Paris of df_2.
New Sales of Paris for that row would be 200+150+200=550.
The expected output would be :
df_primary_output
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 400
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Here are the dataframes to start with :
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Would be great if the solution could work for these inputs as well:
df1
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 100
1 1 11 2023 Berlin Chicago 300
2 1 11 2022 Newyork Chicago 150
3 1 11 2022 Berlin Chicago 400
df2
Day ID Year Primary_Location Sales
0 1 11 2023 Chicago 150
1 1 11 2022 Chicago 300
The expected output would be :
df_primary_output
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 250
1 1 11 2023 Berlin Chicago 450
2 1 11 2022 Newyork Chicago 450
3 1 11 2022 Berlin Chicago 700

This should work:
s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))
or
df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))
(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column
Original Answer:
s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))
Output:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750

Not so easy your question...
Proposed script
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
tot = []
def func(g, iterdf, len_df1, i = 0):
global tot
kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
while i < len_df1:
row = next(iterdf)[1]
# Select specific df1 rows to modify by ID and Year criteria
if g['ID'][g.index[0]]==row['ID'] and g['Year'][g.index[0]]==row['Year']:
tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
i+=1
df2.groupby(['ID', 'Year'], sort=False).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)
Result :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Are you sure of the result in line 3, my script found 450 and not 400 ?
Explanation :
1 - group(...).apply(...) sends two groups from df2 one by one to func() :
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
ID Year Primary_Location Sales
2 12 2022 Chicago 300
3 12 2022 Paris 350
2 - kv returns dictionnaries from df2 like this :
(each iteration corresponds to a group ie ID + Year)
call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}
3 - Function while followed by used of next(iterator) allows to explore rows in g (group) one by one :
while i < len_df1:
row = next(iterdf)[1]
...
i+=1
4 - The if condition in while loop allows to filter df1 rows in order that ID and Year corresponds to df2's ones.
And for each correspondance to append to the global list tot df1 and df2 sales values
5 - tot is a global list for memorizing values and is passed to df1 for the Sales column creation :
df1['Sales'] = tot
Result with the new dataframes sample :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Berlin Chicago 450
2 11 2022 Newyork Chicago 450
3 11 2022 Berlin Chicago 700

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.

arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Transform dataframe values with multilevel indices to single column

I would like to please ask your advice.
How can I transform the first dataframe into the second, below?
Continent, Country and Location are names of column indices.
Polution_level would be added as the column name of the values present on the first dataframe.
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
Continent Country Location Date Polution_Level
Asia Japan Tokyo 01 Jan 20 250
Asia Japan Tokyo 02 Jan 20 252
Asia Japan Tokyo 03 Jan 20 253
...
Europe Portugal Lisbon 03 Jan 20 138
Thank you.

The following should do what you want.
Modules
import io
import pandas as pd
Create data
df = pd.read_csv(io.StringIO("""
Continent Asia Asia Africa Europe
Country Japan China Mozambique Portugal
Location Tokyo Shanghai Maputo Lisbon
Date
01 Jan 20 250 435 45 137
02 Jan 20 252 457 43 144
03 Jan 20 253 463 42 138
"""), sep="\s\s+", engine="python", header=[0,1,2], index_col=[0])
Verify multiindex
df.columns
MultiIndex([( 'Asia', 'Japan', 'Tokyo'),
( 'Asia', 'China', 'Shanghai'),
('Africa', 'Mozambique', 'Maputo'),
('Europe', 'Portugal', 'Lisbon')],
names=['Continent', 'Country', 'Location'])
Transpose table and stack values
ndf = df.T.stack().reset_index()
ndf.rename({0:'Polution_Level'}, axis=1)

how to use groupby() in this case?

let's say: there is a data frame:
country edition sports Athletes Medals
Germany 1990 Aquatics HAJOS, Alfred silver
Germany 1990 Aquatics HIRSCHMANN, Otto silver
Germany 1990 Aquatics DRIVAS, Dimitrios silver
US 2008 Athletics MALOKINIS, Ioannis silver
US 2008 Athletics HAJOS, Alfred silver
US 2009 Athletics CHASAPIS, Spiridon gold
France 2010 Athletics CHOROPHAS, Efstathios gold
France 2010 golf HAJOS, Alfred silver
France 2011 golf ANDREOU, Joannis silver
i want to find out Which edition distributed the most silver medals?
so i'm trying to solve it by groupby function in this way :
df.groupby('Edition')[df['Medal']=='Silver'].count().idxmax()
but its giving me
Key error = 'Columns not found: False, True'
can anyone tell me what is the issue?

So here's your pandas dataframe:
import pandas as pd
data = [
['Germany', 1990, 'Aquatics', 'HAJOS, Alfred', 'silver'],
['Germany', 1990, 'Aquatics', 'IRSCHMANN, Otto', 'silver'],
['Germany', 1990, 'Aquatics', 'DRIVAS, Dimitrios', 'silver'],
['US', 2008, 'Athletics', 'MALOKINIS, Ioannis', 'silver'],
['US', 2008, 'Athletics', 'HAJOS, Alfred', 'silver'],
['US', 2009, 'Athletics', 'CHASAPIS, Spiridon', 'gold'],
['France', 2010, 'Athletics', 'CHOROPHAS, Efstathios', 'gold'],
['France', 2010, 'golf', 'HAJOS, Alfred', 'silver'],
['France', 2011, 'golf', 'ANDREOU, Joannis', 'silver']
]
df = pd.DataFrame(data, columns = ['country', 'edition', 'sports', 'Athletes', 'Medals'])
print(df)
country edition sports Athletes Medals
0 Germany 1990 Aquatics HAJOS, Alfred silver
1 Germany 1990 Aquatics IRSCHMANN, Otto silver
2 Germany 1990 Aquatics DRIVAS, Dimitrios silver
3 US 2008 Athletics MALOKINIS, Ioannis silver
4 US 2008 Athletics HAJOS, Alfred silver
5 US 2009 Athletics CHASAPIS, Spiridon gold
6 France 2010 Athletics CHOROPHAS, Efstathios gold
7 France 2010 golf HAJOS, Alfred silver
8 France 2011 golf ANDREOU, Joannis silver
Now, you can simply filter silver medals then groupby edition (note that 'Edition' will throw a KeyError as opposed to 'edition') and finally get the count:
df[df.Medals == 'silver'].groupby('edition').count()['Medals'].idxmax()
>>> 1990

You can group by both columns to solve:
df[df['Medals'] == 'silver'].groupby(['edition','Medals'],as_index=True)['Athletes'].count().idxmax()
# Outcome:
(1990, 'silver')

df[df['Medal']=='silver'].groupby('edition').size().idxmax()
I tried this and it worked! i just replaced count() by size()

You should count per edition per medal:
>>> df = pd.DataFrame({'edition':[1990,1990,1990,2008,2008,2009,2010,2010,2011],'Medals':['silver','silver','silver','silver','silver','gold','gold','silver','silver']})
>>> df['count'] = ''
>>> df['count'] = df.groupby(['edition','Medals']).transform('count')
Then do the filtering on max():
>>> df = df[df['Medals'].isin(['silver'])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
3 2008 silver 2
4 2008 silver 2
7 2010 silver 1
8 2011 silver 1
>>> df = df[df['count'].isin([df['count'].max()])]
>>> df
edition Medals count
0 1990 silver 3
1 1990 silver 3
2 1990 silver 3
or
>>> df[df['count'].isin([df['count'].max()])]['Medals'].unique()[0]
'silver'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace values from a dataframe with values from another with Pandas - python

Related

How to remove rows from dataframe where data from another dataframe DOESN'T match

sum rows from two different data frames based on the value of columns

Cumsum with groupby

Transform dataframe values with multilevel indices to single column

how to use groupby() in this case?

Categories

Resources