Pandas where function - python

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?

Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135

You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

Related

Calculate summary statistic by category and filter - efficient code?

I have the two following dataframes.
df1:
code name region
0 AFG Afghanistan Middle East
1 NLD Netherlands Western Europe
2 AUT Austria Western Europe
3 IRQ Iraq Middle East
4 USA United States North America
5 CAD Canada North America
df2:
code year gdp per capita
0 AFG 2010 547.35
1 NLD 2010 44851.27
2 AUT 2010 3577.10
3 IRQ 2010 4052.06
4 USA 2010 52760.00
5 CAD 2010 41155.32
6 AFG 2015 578.47
7 NLD 2015 45175.23
8 AUT 2015 3952.80
9 IRQ 2015 4688.32
10 USA 2015 56863.37
11 CAD 2015 43635.10
I want to return the code, year, gdp per capita, and average (gdp per capita per region per year) for 2015 for countries with gdp above average for their region (should be NLD, IRQ, USA).
The result should look something like this:
code year gdp per capita average
3 NLD 2015 45175.23 24564.015
7 IRQ 2015 4688.32 2633.395
9 USA 2015 56863.37 50249.235
I wanted to try this in Python because I recently completed an introductory course to SQL and was amazed at the simplicity of the solution in SQL. While I managed to make it work in Python, it seems overly complicated to me. Is there any way to achieve the same result with less code or without the need for .groupby and helper columns? Please see my solution below.
data = pd.merge(df1, df2, how="inner", on="code")
grouper = data.groupby(["region", "year"])["gdp per capita"].mean().reset_index()
for i in range(len(data)):
average = (grouper.loc[(grouper["year"] == data.loc[i, "year"]) & (grouper["region"] == data.loc[i, "region"]), "gdp per capita"]).to_list()[0]
data.loc[i, "average"] = average
result = data.loc[(data["year"] == 2015) & (data["gdp per capita"] > data["average"]), ["code", "year", "gdp per capita", "average"]]
print(result)
Loops are basically never the right answer when it comes to pandas.
# This is your join and where clause.
df = df1.merge(df2, on='code')[lambda x: x.year.eq(2015)]
# This is your aggregate function.
df['average'] = df.groupby(['region'])['gdp per capita'].transform('mean')
# This is your select and having clause.
out = df[df['gdp per capita'].gt(df['average'])][['code', 'year', 'gdp per capita', 'average']]
print(out)
Output:
code year gdp per capita average
3 NLD 2015 45175.23 24564.015
7 IRQ 2015 4688.32 2633.395
9 USA 2015 56863.37 50249.235

Replace NA in DataFrame for multiple columns with mean per country

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

Creating a set of columns from rows using pandas

I have a dataframe that currently looks like this:
Year Country Subject Descriptor GDP
0 2015 Austria r 344.2
1 2015 Austria n 344.2
2 2015 Austria d 100
3 2015 Austria u 5.742
4 2015 Belgium r 416.7
5 2015 Belgium n 416.7
6 2015 Belgium d 100
7 2015 Belgium u 8.483
I want to transform it to look something along these lines:
Year Country GDP_R GDP_N GDP_D GDP_U
2015 Austria 344.2 344.2 100 5.742
2015 Belgium 416.7 416.7 100 8.483
So far I have attempted to use melt and stack but I feel like I'm just missing it, if you can help me here it'd be much appreciated.
Thank you!
You can first use groupby.agg() and put all values of GDP column in a list. Then, you can convert the object to a new DataFrame, using as columns the prefix 'GDP_' and all the values of the Subject Descriptor column.
Finally, putting the two together using pd.concat() will give your final output.
Please see below an example:
one = df.groupby(['Year','Country'])['GDP'].agg(list).reset_index()
two = pd.DataFrame(one['GDP'].to_list(), columns=['GDP_' + s.upper() for s in set(df['Subject Descriptor'].tolist())])
new = pd.concat([one,two],axis=1).drop('GDP',axis=1)
new prints back:
Year Country GDP_D GDP_N GDP_R GDP_U
0 2015 Austria 344.2 344.2 100.0 5.742
1 2015 Belgium 416.7 416.7 100.0 8.483
First you can use groupby on ['Year', 'Country'] and next you can convert the GDPs for each group to a list and then transpose them to columns. Last few steps are to rename columns, reset index and remove column axis name.
(
df.groupby(['Year', 'Country'])
.apply(lambda x: pd.Series(x.GDP.tolist(), index=x['Subject Descriptor']))
.rename(columns = lambda x: f'GDP_{x.upper()}')
.reset_index()
.rename_axis('', axis=1)
)
You can use a pivot in this case :
(df.pivot(['Year', 'Country'], 'Subject_Descriptor', 'GDP')
.rename(columns = lambda col: f"GDP_{col.upper()}")
.rename_axis(columns=None).reset_index()
)
Year Country GDP_D GDP_N GDP_R GDP_U
0 2015 Austria 100.0 344.2 344.2 5.742
1 2015 Belgium 100.0 416.7 416.7 8.483

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Categories

Resources