How to take mean across row in Pandas pivot table Dataframe? [duplicate] - python

This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
I have a pandas dataframe as seen below which is a pivot table. I would like to print Africa in 2007 as well as do the mean of the entire Americas row; any ideas how to do this? I have been doing combinations of stack/unstack for a while now to no avail.
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 12 13 15 20 39 25 81 12 22 23 25 44
Americas 12 14 65 10 119 15 21 42 47 84 15 89
Asia 12 13 89 20 39 25 81 29 77 23 25 89
Europe 12 13 15 20 39 25 81 29 23 32 15 89
Oceania 12 13 15 20 39 25 81 27 32 85 25 89

import pandas as pd
df = pd.read_csv('dummy_data.csv')
# handy to see the continent name against the value rather than '0' or '3'
df.set_index('continent', inplace=True)
# print mean for all rows - see how the continent name helps here
print(df.mean(axis=1))
print('---')
print()
# print the mean for just the 'Americas' row
print(df.mean(axis=1)['Americas'])
print('---')
print()
# print the value of the 'Africa' row for the year (column) 2007
print(df.query('continent == "Africa"')['2007'])
print('---')
print()
Output:
continent
Africa 27.583333
Americas 44.416667
Asia 43.500000
Europe 32.750000
Oceania 38.583333
dtype: float64
---
44.416666666666664
---
continent
Africa 44
Name: 2007, dtype: int64
---

Related

Web Scraping School Project

I need help with a school project. The code that I have "#" I can't seem to get to work with the table I scraped. I need to change it into a data frame. Can anyone see what I'm missing and if I am missing a step.
Tertiary=pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
Tertiary=pd.DataFrame(Tertiary[1])
#Tertiary=Tertiary.drop(["Non-OECD"], axis=1, inplace=True)
print(Tertiary.dtypes)
#Tertiary["Age25-64(%)"] = pd.to_numeric(Tertiary["Age25-64(%)"])
#Tertiary["Age"] = pd.to_numeric(Tertiary["Age"])
print(Tertiary.dtypes)
print()
#print(Tertiary.describe)
print()
#print(Tertiary.isnull().sum())
#print(Tertiary)
Everything works fine for me.
import pandas as pd
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
table = pd.DataFrame(df[1])
print(table)
print(table.columns)
Output:
Country Age 25–64 (%) Age Year Non-OECD
Country Age 25–64 (%) 25–34 (%) 35–44 (%) 45–54 (%) 55–64 (%) Year Non-OECD
0 Australia 42 48 46 38 33 2014 NaN
1 Austria 30 38 33 27 21 2014 NaN
2 Belgium 37 44 42 34 26 2014 NaN
3 Brazil 14 15 14 14 11 2013 NaN
4 Canada 54 58 61 51 45 2014 NaN
5 Chile 21 27 24 17 14 2013 NaN
6 China 17 27 15 7 2 2018 NaN
7 Colombia 22 28 23 18 16 2014 NaN
8 Costa Rica 18 21 19 17 17 2014 NaN
9 Czech Republic 22 30 21 20 15 2014 NaN
10 Denmark 36 42 41 33 29 2014 NaN
11 Estonia 38 40 39 35 36 2014 NaN
12 Finland 42 40 50 44 34 2014 NaN
13 France 32 44 39 26 20 2013 NaN
14 Germany 27 28 29 26 25 2014 NaN
15 Greece 28 39 27 26 21 2014 NaN
16 Hungary 23 32 25 20 17 2014 NaN
17 Iceland 37 41 42 36 29 2014 NaN
18 Indonesia 8 10 9 8 4 2011 NaN
19 Ireland 41 51 49 34 24 2014 NaN
20 Israel 49 46 53 48 47 2014 NaN
21 Italy 17 24 19 13 12 2014 NaN
22 Japan 48 59 53 47 35 2014 NaN
23 Latvia 30 39 31 27 23 2014 NaN
24 Lithuania 37 53 38 30 28 2014 NaN
25 Luxembourg 46 53 56 40 32 2014 NaN
26 Mexico 19 25 17 16 13 2014 NaN
27 Netherlands 34 44 38 30 27 2014 NaN
28 New Zealand 36 40 41 32 29 2014 NaN
29 Norway 42 49 49 36 32 2014 NaN
30 Poland 27 43 32 18 14 2014 NaN
31 Portugal 22 31 26 17 13 2014 NaN
32 Russia 54 58 55 53 50 2013 NaN
33 Saudi Arabia 22 26 22 18 14 2013 NaN
34 Slovakia 20 30 21 15 14 2014 NaN
35 Slovenia 29 38 35 24 18 2014 NaN
36 South Africa 7 5 7 8 7 2012 NaN
37 South Korea 45 68 56 33 17 2014 NaN
38 Spain 35 41 43 30 21 2014 NaN
39 Sweden 39 46 46 32 30 2014 NaN
40 Switzerland 40 46 45 38 31 2014 NaN
41 Turkey 17 25 16 10 10 2014 NaN
42 Taiwan[3] 45 X X X X 2015 NaN
43 United Kingdom 42 49 46 38 35 2014 NaN
44 United States 44 46 47 43 41 2014 NaN
__
MultiIndex([( 'Country', 'Country'),
('Age 25–64 (%)', 'Age 25–64 (%)'),
( 'Age', '25–34 (%)'),
( 'Age', '35–44 (%)'),
( 'Age', '45–54 (%)'),
( 'Age', '55–64 (%)'),
( 'Year', 'Year'),
( 'Non-OECD', 'Non-OECD')],
)

How to transpose or pivote a table? Selecting specific columns

beginner here!
I have a dataframe similar to this:
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
Index Country_Code Inndicator_Name 2005 2006
0 FR GPD 14 23
1 FR Pop 34 34
2 FR birth 56 34
3 USA GPD 25 43
4 USA Pop 67 34
5 USA birth 68 34
6 BR GPD 55 65
7 BR Pop 8 34
8 BR birth 99 45
I need to pivot or transpose it, keeping the Country Code, the years, and the indicators names as columns, like this:
index Country_Code year GPD Pop Birth
0 FR 2005 14 34 56
1 FR 2006 23 34 34
3 USA 2005 25 67 68
4 USA 2006 43 34 34
...
I used the transposed function like this:
df.set_index(['Indicator Name']).transpose()
The result is nice, but I have the Countries as a row like this:
Inndicator_Name GPD Pop birth GPD Pop birth GPD Pop birth
Country_Code FR FR FR USA USA USA BR BR BR
2005 14 34 56 25 67 68 55 8 99
2006 23 34 34 43 34 34 65 34 45
I also tried to use the "pivot" and the "pivot table" function, but the result is not satisfactory. Could you please give me some advice?
import pandas as pd
df = pd.DataFrame({'Country_Code':['FR','FR','FR','USA','USA','USA','BR','BR','BR'],'Indicator_Name':['GPD','Pop','birth','GPD','Pop','birth','GPD','Pop','birth'],'2005':[14,34,56, 25, 67, 68, 55, 8,99], '2006':[23, 34, 34, 43,34,34, 65, 34,45]})
df
#%% Pivot longer columns `'2005'` and `'2006'` to `'Year'`
df1 = df.melt(id_vars=["Country_Code", "Indicator_Name"],
var_name="Year",
value_name="Value")
#%% Pivot wider by values in `'Indicator_Name'`
df2 = (df1.pivot_table(index=['Country_Code', 'Year'],
columns=['Indicator_Name'],
values=['Value'],
aggfunc='first'))
Output:
Value
Indicator_Name GPD Pop birth
Country_Code Year
BR 2005 55 8 99
2006 65 34 45
FR 2005 14 34 56
2006 23 34 34
USA 2005 25 67 68
2006 43 34 34
The simplest in my opinion, you can pivot+stack:
(df.pivot(index='Country_Code', columns='Indicator_Name')
.rename_axis(columns=['year', None]).stack(0).reset_index()
)
output:
Country_Code year GPD Pop birth
0 BR 2005 55 8 99
1 BR 2006 65 34 45
2 FR 2005 14 34 56
3 FR 2006 23 34 34
4 USA 2005 25 67 68
5 USA 2006 43 34 34

Importing Excel data with merging cells

How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?
You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7

Reshape dataframe with multiindexed column headers from wide to long

I'd like to reshape a pandas dataframe from wide to long. The challenge lies in the fact that the columns have got multiindexed column headers. The dataframe looks like this:
category price1 price2
year 2011 2012 2013 2011 2012 2013
1 33 22 48 135 144 149
2 22 26 37 136 127 129
3 39 30 47 123 148 148
4 45 42 21 140 126 121
5 20 37 35 141 142 147
6 29 20 34 122 121 132
7 20 35 45 128 123 130
8 39 34 49 125 120 131
9 24 20 36 122 146 130
10 24 37 43 142 133 138
11 23 22 40 124 135 131
12 27 22 40 147 149 132
Below is a snippet that produces the very same dataframe. You will also see that I've built this dataframe by concatenating two other dataframes.
Here's the snippet:
import pandas as pd
import numpy as np
# Make dataframe df1 with 12 observations over 3 years
# with multiindexed column headers
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(20, 50, size = (12,3)), columns=[2011,2012,2013])
df1.index = np.arange(1,len(df1)+1)
colNames1 = df1.columns
header1 = pd.MultiIndex.from_product([['price1'], colNames1], names=['category','year'])
df1.columns = header1
# Make dataframe df2 with 12 observations over 3 years
# with multiindexed column headers
df2 = pd.DataFrame(np.random.randint(120, 150, size = (12,3)), columns=[2011,2012,2013])
df2.index = np.arange(1,len(df2)+1)
colNames1 = df2.columns
header1 = pd.MultiIndex.from_product([['price2'], colNames1], names=['category','year'])
df2.columns = header1
df3 = pd.concat([df1, df2], axis = 1)
And here is the desired output:
price1 price2
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148
4 2012 42 126
5 2012 37 142
6 2012 20 121
7 2012 35 123
8 2012 34 120
9 2012 20 146
10 2012 37 133
11 2012 22 135
12 2012 22 149
1 2013 48 149
2 2013 37 129
3 2013 47 148
4 2013 21 121
5 2013 35 147
6 2013 34 132
7 2013 45 130
8 2013 49 131
9 2013 36 130
10 2013 43 138
11 2013 40 131
12 2013 40 132
I've tried different solutions based on suggestions with Reshape and pandas.wide_to_long, but I'm struggling with the multiindexed column names. So why not just remove this? Mostly because this is what my real world problem will look like, and also because I refuse to believe that it can't be done.
Thank you for any suggestions!
Use stack be last level and sort_index, add rename_axis and reset_index for columns:
df3 = (df3.stack()
.sort_index(level=[1,0])
.rename_axis(['months','year'])
.reset_index()
.rename_axis(None, 1))
print (df3.head(15))
months year price1 price2
0 1 2011 33 135
1 2 2011 22 136
2 3 2011 39 123
3 4 2011 45 140
4 5 2011 20 141
5 6 2011 29 122
6 7 2011 20 128
7 8 2011 39 125
8 9 2011 24 122
9 10 2011 24 142
10 11 2011 23 124
11 12 2011 27 147
12 1 2012 22 144
13 2 2012 26 127
14 3 2012 30 148
If need MutliIndex:
df3 = df3.stack().sort_index(level=[1,0])
print (df3.head(15))
category price1 price2
year
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148

Pandas: transform column's values in independent columns

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?
You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

Categories

Resources