Replace NA in DataFrame for multiple columns with mean per country

Replace NA in DataFrame for multiple columns with mean per country - python

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?

Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113

This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

Related

Pandas where function

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?

Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135

You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

Extract values from dataset to perform functions- multiple countries within dataset

My dataset looks as follows:
Country
Year
Value
Ireland
2010
9
Ireland
2011
11
Ireland
2012
14
Ireland
2013
17
Ireland
2014
20
France
2011
15
France
2012
19
France
2013
21
France
2014
28
Germany
2008
17
Germany
2009
20
Germany
2010
19
Germany
2011
24
Germany
2012
27
Germany
2013
32
My goal is to create a new dataset which tells me the % increase from the first year of available data for a given country, compared to the most recent, which would look roughly as follows:
Country
% increase
Ireland
122
France
87
Germany
88
In essence, I need my code for each country in my dataset, to locate the smallest and largest value for year, then take the corresponding values within the value column and calculate the % increase.
I can do this manually, however I have a lot of countries in my dataset and am looking for a more elegant way to do it. I am trying to troubleshoot my code for this however I am not having much luck as of yet.
My code looks as follows at present:
df_1["Min_value"] = df.loc[df["Year"].min(),"Value"].iloc[0]
df_1["Max_value"] = df.loc[df["Year"].max(),"Value"].iloc[0]
df_1["% increase"] = ((df_1["Max_value"]-df_1["Min_value"])/df_1["Min_value"])*100
This returns an error:
AttributeError: 'numpy.float64' object has no attribute 'iloc'
In addition to this it also has the issue that I cannot figure out a way to have the code to run individually for each country within my dataset, so this is another challenge which I am not entirely sure how to address.
Could I potentially go down the route of defining a particular function which could then be applied to each country?

You can group by Country and aggregate min/max for both Year and Value, then calculate percentage change between min and max of the Value.
pct_df = df.groupby(['Country']).agg(['min', 'max'])['Value']\
.apply(lambda x: x.pct_change().round(2) * 100, axis=1)\
.drop('min', axis=1).rename(columns={'max':'% increase'}).reset_index()
print(pct_df)
The output:
Country % increase
0 France 87.0
1 Germany 88.0
2 Ireland 122.0

Sorting grouped DataFrame column without changing index sorting

I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .

You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns

If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032

where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work

I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace NA in DataFrame for multiple columns with mean per country - python

Related

Pandas where function

Extract values from dataset to perform functions- multiple countries within dataset

Sorting grouped DataFrame column without changing index sorting

Calculating new rows in a Pandas Dataframe on two different columns

How to output the top 5 of a specific column along with associated columns using python?

Categories

Resources