Apply groupby on a DataFrame to display cumulative stats - python

Let's say I have a DataFrame that looks like this:
Bank Name House This Wk
Barc Germany 100
Barc UK 300
Barc UK 500
JPM Japan 200
JPM NYC 100
BOA LA 900
BOA LA 50
BOA LA 50
DB Italy 45
I would like to group-by Bank Name, while outputting the largest House Value as well as the total value...
For example, using the example above would result in:
Bank Name Total House This Wk
Barc 900 UK 500
JPM 300 Japan 200
BOA 1000 LA 900
DB 45 Italy 45
Essentially, it is grouping the Total by Bank Name, but also outputting the largest contributor, House, to the total and the amount contributed is This Wk.
How can I go about doing this?

In [121]: df.groupby('Bank Name', group_keys=False) \
...: .apply(lambda x: x.nlargest(1, 'This Wk').assign(Total=x['This Wk'].sum())) \
...: [['Bank Name','Total','House','This Wk']]
...:
Out[121]:
Bank Name Total House This Wk
5 BOA 1000 LA 900
2 Barc 900 UK 500
8 DB 45 Italy 45
3 JPM 300 Japan 200

You can consider df.groupby with a list of dfGroupBy.agg functions:
In [732]: out = df.groupby('Bank Name')['This Wk'].agg(['sum', 'idxmax', 'max'])\
.rename(columns={'sum' : 'Total', 'idxmax' : 'House', 'max' : 'This Wk'})\
.reset_index()
In [734]: out['House'] = df.loc[out['House'], 'House'].values; out
Out[734]:
Bank Name Total House This Wk
0 BOA 1000 LA 900
1 Barc 900 UK 500
2 DB 45 Italy 45
3 JPM 300 Japan 200

Another way using apply would be
In [17]: (df.groupby('Bank Name', sort=False)
.apply(lambda x: pd.Series(
[x['This Wk'].sum(),
x.loc[x['This Wk'].idxmax(), 'House'],
x['This Wk'].max()],
index=['Total', 'House', 'This Wk']))
.reset_index())
Out[17]:
Bank Name Total House This Wk
0 Barc 900 UK 500
1 JPM 300 Japan 200
2 BOA 1000 LA 900
3 DB 45 Italy 45

Related

reshaping the dataset in python

I have this dataset:
Account
lookup
FY11USD
FY12USD
FY11local
FY12local
Sales
CA
1000
5000
800
4800
Sales
JP
5000
6500
10
15
Trying to arrive to get the data in this format: (below example has 2 years of data but no. of years can vary)
Account
lookup
Year
USD
Local
Sales
CA
FY11
1000
800
Sales
CA
FY12
5000
4800
Sales
JP
FY11
5000
10
Sales
JP
FY12
6500
15
I tried using the below script, but it doesn't segregate USD and local for the same year. How should I go about that?
df.melt(id_vars=["Account", "lookup"],
var_name="Year",
value_name="Value")
You can piece it together like so:
dfn = (pd.concat(
[df[["Account", "lookup", 'FY11USD','FY12USD']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="USD"),
df[["Account", "lookup", 'FY11local','FY12local']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="Local")[['Local']]], axis=1 ))
dfn['Year'] = dfn['Year'].str[:4]
Output
Account lookup Year USD Local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = ['Account', 'lookup'],
names_to = ('Year', '.value'),
names_pattern = r"(FY\d+)(.+)")
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
Another option is to use stack:
temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
You can also pull it off with pd.wide_to_long after reshaping the columns:
index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
.columns
.str.split('(FY\d+)')
.str[::-1]
.str.join('')
)
(pd.wide_to_long(
temp.reset_index(),
stubnames = ['USD', 'local'],
i = index,
j = 'Year',
suffix = '.+')
.reset_index()
)
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15

Different aggfunc based on different logics for pandas pivot table

I would like to apply different "aggfunc" logics to a pandas pivot table. Lets suppose that I have the below df.
df1 = pd.DataFrame({'Country':['Italy', 'Italy', 'Italy', 'Germany','Germany', 'Germany', 'France', 'France'],
'City':['Rome','Rome',"Florence",'Berlin', 'Munich', 'Koln', "Paris", "Paris"],
'Numbers':[100,200,300,400,500,600,700,800]})
I would like to calculate the sum of "Numbers" per City and the mean of "Numbers" based on the Country. I should get the below output.
I must use the pd.pivot. But if you have better solutions, you can ALSO suggest that.
Would you be able to help me out?
Country
City
SUM
MEAN
France
Paris
1500
750
Germany
Berlin
400
500
Germany
Köln
600
500
Germany
Munich
500
500
Italy
Florence
300
200
Italy
Rome
300
200
I have tried using the following but it obviously does not work.
pd.pivot_table(df1, values = 'Numbers', index=['Country', 'City'], aggfunc=[np.sum, np.mean])
use GroupBy.transform
new_df = \
df1.assign(
SUM = df1.groupby('City', sort=False)['Numbers'].transform('sum'),
MEAN = df1.groupby('Country', sort=False)['Numbers'].transform('mean')
).drop_duplicates(['Country', 'City']).drop('Numbers', axis=1)
Country City SUM MEAN
0 Italy Rome 300 200
1 Italy Rome 300 200
2 Italy Florence 300 200
3 Germany Berlin 400 500
4 Germany Munich 500 500
5 Germany Koln 600 500
6 France Paris 1500 750
7 France Paris 1500 750

Replace NA in DataFrame for multiple columns with mean per country

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

how to check and groupby all the objects starts with of a dataframe in column

Have a dataframe where I need to check , group by and sum all the data
I have used regex function to find and group all the particular group of data starts with respective countries.
Suppose I have a dataset
Countries 31-12-17 1-1-18 2-1-18 3-1-18 Sum
India-Basic 1200 1100 800 900 4000
Sweden-Basic 1500 1300 700 1500 5000
Norway-Basic 800 400 900 900 3000
India-Exp 600 1400 300 200 2500
Sweden-Exp 1800 400 600 700 3500
Norway-Exp 1300 1600 1100 1500 4500
Expected Output :
Countries Sum
India 6500
Sweden 8500
Norway 7500
India
Use for regex solution Series.str.extract and aggregate sum:
df1 = (df.groupby(df['Countries'].str.extract('(.*)-', expand=False), sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
Alternative si split Countries by - and select first lists by str[0]:
df1 = (df.groupby(df['Countries'].str.split('-').str[0], sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
this could work - note that i only filtered for the columns that are relevant :
(df.filter(['Countries','Sum'])
.assign(Countries = lambda x: x.Countries.str.split('-').str.get(0))
.groupby('Countries')
.agg('sum')
)
Sum
Countries
India 6500
Norway 7500
Sweden 8500

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)
Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

Categories

Resources