Pivot table in Python using Pandas

Pivot table in Python using Pandas - python

I have a data frame which has data in following format:
I have to pivot up the Status column and Pivot down the states Columns to make table look like:
I am trying to do it using pd.pivot_table but unable to get the desired results.
Here is what I am trying:
table = pd.pivot_table(data = covid19_df_latest, index = ['Date', 'Delhi', 'Maharashtra', 'Haryana'], values = ['Status'], aggfunc = np.max)
print(table)
I am getting error "No numeric types to aggregate", Please suggest

Use DataFrame.melt with DataFrame.pivot_table:
df= (covid19_df_latest.melt(['Date','Status'], var_name='State')
.pivot_table(index=['Date','State'],
columns='Status',
values='value',
aggfunc='max')
.reset_index()
.rename_axis(None, axis=1))
print (df)
Date State Deceased Identified Recovered
0 14/05/20 Delhi 1200 10000 2000
1 14/05/20 Haryana 1000 20000 800
2 14/05/20 Maharashtra 1000 15000 3700
Details: Solution first unpivot Dataframe by melt:
print (covid19_df_latest.melt(['Date','Status'], var_name='State'))
Date Status State value
0 14/05/20 Identified Delhi 10000
1 14/05/20 Recovered Delhi 2000
2 14/05/20 Deceased Delhi 1200
3 14/05/20 Identified Maharashtra 15000
4 14/05/20 Recovered Maharashtra 3700
5 14/05/20 Deceased Maharashtra 1000
6 14/05/20 Identified Haryana 20000
7 14/05/20 Recovered Haryana 800
8 14/05/20 Deceased Haryana 1000
and then pivoting with max aggregate function.

Related

Replace NA in DataFrame for multiple columns with mean per country

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?

Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113

This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

Fill missing values in a data frame based without losing values already in the data frame

I have data frame with missing values:
import pandas as pd
data = {'Brand':['residential','unclassified','tertiary','residential','unclassified','primary','residential'],
'Price': [22000,25000,27000,"NA","NA",10000,"NA"]
}
df = pd.DataFrame(data, columns = ['Brand', 'Price'])
print (df)
Resulting in this data frame:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential NA
4 unclassified NA
5 primary 10000
6 residential NA
I would like to fill in the missing values for residential and unclassified in the prices column with fixed values (residential =1000, unclassified=2000), however I dont want to lose any values that are already present in the prices column for residential or unclassified, so the out put should look like this:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential 1000
4 unclassified 2000
5 primary 10000
6 residential 1000
Whats the easiest way to get this done

We can do map with fillna , PS: you need to make sure in your df, NA is NaN
df.Price.fillna(df.Brand.map({'residential':1000,'unclassified':2000}),inplace=True)
df
Brand Price
0 residential 22000.0
1 unclassified 25000.0
2 tertiary 27000.0
3 residential 1000.0
4 unclassified 2000.0
5 primary 10000.0
6 residential 1000.0

how to check and groupby all the objects starts with of a dataframe in column

Have a dataframe where I need to check , group by and sum all the data
I have used regex function to find and group all the particular group of data starts with respective countries.
Suppose I have a dataset
Countries 31-12-17 1-1-18 2-1-18 3-1-18 Sum
India-Basic 1200 1100 800 900 4000
Sweden-Basic 1500 1300 700 1500 5000
Norway-Basic 800 400 900 900 3000
India-Exp 600 1400 300 200 2500
Sweden-Exp 1800 400 600 700 3500
Norway-Exp 1300 1600 1100 1500 4500
Expected Output :
Countries Sum
India 6500
Sweden 8500
Norway 7500
India

Use for regex solution Series.str.extract and aggregate sum:
df1 = (df.groupby(df['Countries'].str.extract('(.*)-', expand=False), sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
Alternative si split Countries by - and select first lists by str[0]:
df1 = (df.groupby(df['Countries'].str.split('-').str[0], sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500

this could work - note that i only filtered for the columns that are relevant :
(df.filter(['Countries','Sum'])
.assign(Countries = lambda x: x.Countries.str.split('-').str.get(0))
.groupby('Countries')
.agg('sum')
)
Sum
Countries
India 6500
Norway 7500
Sweden 8500

How do I Group By Date and Measure fields to calculate rank?

I have a data set with Student Names, the date of transaction and the amount.
Each student has made multiple transactions.
I want to calculate current month rank and previous month rank based on total amount for each student.
I am able to do a group by Student Name to calculate the total amount for each student using:
transactions['Totals'] = transactions.groupby('Student Name')['Sale Amount'].transform('sum')
How do I extend this to make two different columns that calculate previous month totals and current month totals for each student, so I can assign previous month and current month ranks to them?
The date is in the following format:
09/05/2015 04:18 PM
07/15/2019 09:50 AM
05/18/2018 02:34 PM
08/11/2018 06:29 PM
06/14/2018 07:42 AM
EDIT : Adding dataframe for reference:
Out[15]:
Date of Transaction Student Name Sale Amount
0 09/05/2015 04:18 PM Dan Kelly 4333
1 07/15/2019 09:50 AM Peter Dyer 8805
2 05/18/2018 02:34 PM Natalie Robertson 5640
3 08/11/2018 06:29 PM Sean Miller 6485
4 06/14/2018 07:42 AM Thomas Forsyth 6815
... ... ...
9977 03/15/2018 09:28 PM Grace Vance 6379
9978 08/07/2019 11:14 PM Alexandra Cameron 6688
9979 01/09/2015 10:53 AM Sebastian Vaughan 2262
9980 05/19/2019 10:00 PM Caroline Blake 6977
9981 01/11/2016 04:05 AM Austin Edmunds 3205
[9982 rows x 3 columns]
EDIT : Adding sample expected output:

I've created a dataframe with the minimal data you informed: 'Student Name', 'Sale Amount', 'Date'
My dataframe:
df = pd.DataFrame([['12/05/2019 04:18 PM','Marisa',500],
['11/29/2019 04:18 PM','Marisa',500],
['11/20/2019 04:18 PM','Marisa',800],
['12/04/2019 04:18 PM','Peter',300],
['11/30/2019 04:18 PM','Peter',300],
['12/05/2019 04:18 PM','Debra',400],
['11/28/2019 04:18 PM','Debra',200],
['11/15/2019 04:18 PM','Debra',600],
['10/23/2019 04:18 PM','Debra',200]],columns=['Date','Student Name','Sale Amount']
)
Be sure date is a datetime column.
df.Date = pd.to_datetime(df.Date)
This gives you the total amount per month per student in the original dataframe:
df['Total'] = df.groupby(['Student Name',pd.Grouper(key='Date', freq='1M')])['Sale Amount'].transform('sum')
Date Student Name Sale Amount Total
0 2019-12-05 16:18:00 Marisa 500 500
1 2019-11-29 16:18:00 Marisa 500 1300
2 2019-11-20 16:18:00 Marisa 800 1300
3 2019-12-04 16:18:00 Peter 300 300
4 2019-11-30 16:18:00 Peter 300 300
5 2019-12-05 16:18:00 Debra 400 400
6 2019-11-28 16:18:00 Debra 200 800
7 2019-11-15 16:18:00 Debra 600 800
8 2019-10-23 16:18:00 Debra 200 200
How to print only the selected results?
df is dnew now:
dnew = df
Let's strip datetime to keep months only:
#Strip date to month
dnew['Date'] = dnew['Date'].apply(lambda x:x.date().strftime('%m'))
Drop Sale Amount entries and group by Student Name and Date (new dataframe is "sales"):
#Drop Sale Amount
sales = dnew.drop(['Sale Amount'], axis=1).groupby(['Student Name','Date'])['Total'].max()
print(sales)
Student Name Date
Debra 10 200
11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300
Actually, "sales" is pandas.core.series.Series and it's important to know that
print(sales.index)
MultiIndex([( 'Debra', '10'),
( 'Debra', '11'),
( 'Debra', '12'),
('Marisa', '11'),
('Marisa', '12'),
( 'Peter', '11'),
( 'Peter', '12')],
names=['Student Name', 'Date'])
from datetime import datetime
curMonth = int(datetime.today().strftime('%m')) #transform to integer to perform (curMonth-1)
#12
#months of interest
moi = sales.iloc[(sales.index.get_level_values('Date') == str(curMonth-1)) | (sales.index.get_level_values('Date') == str(curMonth))]
print(moi)
Student Name Date
Debra 11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300

Assigning values to a dataframe by iterating through two lists

I have a dataframe of meaning housing prices by month that looks like this
RegionName 2000-01 2000-02 2000-03
New York 200000 210000 220000
Austin 100000 110000 130000 ...
Los Angeles 180000 190000 200000
I have a list of lists of months corresponding to quarters and a list of quarters that look like
month_chunks = [['2000-01', '2000-02', '2000-03'], ['2000-04', '2000-05', '2000-06']...]
quarters = ['2000q1', '2000q2', '2000q3'...]
I'm trying to create columns in my dataframe that contain mean prices by quarter
for quarter, chunk in zip(quarters, month_chunks):
housing[quarter] = np.mean(housing[chunk].mean())
RegionName 2000-01 2000-02 2000-03 2000q1
New York 200000 210000 220000 210000
Austin 100000 110000 130000 ... 113333.333
Los Angeles 180000 190000 200000 190000
But it is giving me columns that are duplicated for each row
RegionName 2000-01 2000-02 2000-03 2000q1
New York 200000 210000 220000 210000
Austin 100000 110000 130000 ... 210000
Los Angeles 180000 190000 200000 210000
The dataframe is large, so iterating through it and the lists is not doable
for i, row in housing.iterrows():
for quarter, chunk in zip(quarters, month_chunks):
row[quarter].iloc[i] = np.mean(row[chunk].iloc[i].mean())

Don't iterrows, you can perform your operation columns wise:
for months, qt in zip(month_chunks, quarters):
housing[qt] = housing[months].mean(axis=1)

Here is one way using groupby
from collections import ChainMap
d=dict(ChainMap(*[dict.fromkeys(x,y)for x , y in zip(month_chunks,quarters)]))
s=housing.set_index('RegionName').groupby(d,axis=1).mean()
s
Out[32]:
2000q1
RegionName
NewYork 210000.000000
Austin 113333.333333
LosAngeles 190000.000000
df=pd.concat([housing.set_index('RegionName'),s],axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pivot table in Python using Pandas - python

Related

Replace NA in DataFrame for multiple columns with mean per country

Fill missing values in a data frame based without losing values already in the data frame

how to check and groupby all the objects starts with of a dataframe in column

How do I Group By Date and Measure fields to calculate rank?

Assigning values to a dataframe by iterating through two lists

Categories

Resources