Pandas: transform column's values in independent columns - python

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?

You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

Related

Convert columns into rows data with Pandas

my dataset has some information by location for n dates. The CSV looks like
Country year2018 year2019 year2020
saleA saleB SaleA SaleB saleA saleB
USA 22 23 323 32 31 65
china 12 12 2 66 66 78
I want my data to be of the form
Country year saleA saleB
USA year2018 22 23
USA year2019 323 32
USA year2020 31 65
china year2018 12 12
.
.
.
How can I do it using pandas?
I tried using pd.melt but couldn't figured out.
You can reshape your dataframe with set_index and stack:
out = (df.set_index('Country')
.rename_axis(columns=['year', None])
.stack('year').reset_index())
Country year saleA saleB
0 USA year2018 22 23
1 USA year2019 323 32
2 USA year2020 31 65
3 China year2018 12 12
4 China year2019 2 66
5 China year2020 66 78
Another solution with melt and pivot_table:
>>> out = (df.melt(id_vars='Country', var_name=['year', 'sale'])
.pivot_table(index=['Country', 'year'], columns='sale', values='value')
.reset_index())

Web Scraping School Project

I need help with a school project. The code that I have "#" I can't seem to get to work with the table I scraped. I need to change it into a data frame. Can anyone see what I'm missing and if I am missing a step.
Tertiary=pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
Tertiary=pd.DataFrame(Tertiary[1])
#Tertiary=Tertiary.drop(["Non-OECD"], axis=1, inplace=True)
print(Tertiary.dtypes)
#Tertiary["Age25-64(%)"] = pd.to_numeric(Tertiary["Age25-64(%)"])
#Tertiary["Age"] = pd.to_numeric(Tertiary["Age"])
print(Tertiary.dtypes)
print()
#print(Tertiary.describe)
print()
#print(Tertiary.isnull().sum())
#print(Tertiary)
Everything works fine for me.
import pandas as pd
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
table = pd.DataFrame(df[1])
print(table)
print(table.columns)
Output:
Country Age 25–64 (%) Age Year Non-OECD
Country Age 25–64 (%) 25–34 (%) 35–44 (%) 45–54 (%) 55–64 (%) Year Non-OECD
0 Australia 42 48 46 38 33 2014 NaN
1 Austria 30 38 33 27 21 2014 NaN
2 Belgium 37 44 42 34 26 2014 NaN
3 Brazil 14 15 14 14 11 2013 NaN
4 Canada 54 58 61 51 45 2014 NaN
5 Chile 21 27 24 17 14 2013 NaN
6 China 17 27 15 7 2 2018 NaN
7 Colombia 22 28 23 18 16 2014 NaN
8 Costa Rica 18 21 19 17 17 2014 NaN
9 Czech Republic 22 30 21 20 15 2014 NaN
10 Denmark 36 42 41 33 29 2014 NaN
11 Estonia 38 40 39 35 36 2014 NaN
12 Finland 42 40 50 44 34 2014 NaN
13 France 32 44 39 26 20 2013 NaN
14 Germany 27 28 29 26 25 2014 NaN
15 Greece 28 39 27 26 21 2014 NaN
16 Hungary 23 32 25 20 17 2014 NaN
17 Iceland 37 41 42 36 29 2014 NaN
18 Indonesia 8 10 9 8 4 2011 NaN
19 Ireland 41 51 49 34 24 2014 NaN
20 Israel 49 46 53 48 47 2014 NaN
21 Italy 17 24 19 13 12 2014 NaN
22 Japan 48 59 53 47 35 2014 NaN
23 Latvia 30 39 31 27 23 2014 NaN
24 Lithuania 37 53 38 30 28 2014 NaN
25 Luxembourg 46 53 56 40 32 2014 NaN
26 Mexico 19 25 17 16 13 2014 NaN
27 Netherlands 34 44 38 30 27 2014 NaN
28 New Zealand 36 40 41 32 29 2014 NaN
29 Norway 42 49 49 36 32 2014 NaN
30 Poland 27 43 32 18 14 2014 NaN
31 Portugal 22 31 26 17 13 2014 NaN
32 Russia 54 58 55 53 50 2013 NaN
33 Saudi Arabia 22 26 22 18 14 2013 NaN
34 Slovakia 20 30 21 15 14 2014 NaN
35 Slovenia 29 38 35 24 18 2014 NaN
36 South Africa 7 5 7 8 7 2012 NaN
37 South Korea 45 68 56 33 17 2014 NaN
38 Spain 35 41 43 30 21 2014 NaN
39 Sweden 39 46 46 32 30 2014 NaN
40 Switzerland 40 46 45 38 31 2014 NaN
41 Turkey 17 25 16 10 10 2014 NaN
42 Taiwan[3] 45 X X X X 2015 NaN
43 United Kingdom 42 49 46 38 35 2014 NaN
44 United States 44 46 47 43 41 2014 NaN
__
MultiIndex([( 'Country', 'Country'),
('Age 25–64 (%)', 'Age 25–64 (%)'),
( 'Age', '25–34 (%)'),
( 'Age', '35–44 (%)'),
( 'Age', '45–54 (%)'),
( 'Age', '55–64 (%)'),
( 'Year', 'Year'),
( 'Non-OECD', 'Non-OECD')],
)

Subtract value of column based on another column

I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27
IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

Importing Excel data with merging cells

How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?
You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7

How to take mean across row in Pandas pivot table Dataframe? [duplicate]

This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
I have a pandas dataframe as seen below which is a pivot table. I would like to print Africa in 2007 as well as do the mean of the entire Americas row; any ideas how to do this? I have been doing combinations of stack/unstack for a while now to no avail.
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 12 13 15 20 39 25 81 12 22 23 25 44
Americas 12 14 65 10 119 15 21 42 47 84 15 89
Asia 12 13 89 20 39 25 81 29 77 23 25 89
Europe 12 13 15 20 39 25 81 29 23 32 15 89
Oceania 12 13 15 20 39 25 81 27 32 85 25 89
import pandas as pd
df = pd.read_csv('dummy_data.csv')
# handy to see the continent name against the value rather than '0' or '3'
df.set_index('continent', inplace=True)
# print mean for all rows - see how the continent name helps here
print(df.mean(axis=1))
print('---')
print()
# print the mean for just the 'Americas' row
print(df.mean(axis=1)['Americas'])
print('---')
print()
# print the value of the 'Africa' row for the year (column) 2007
print(df.query('continent == "Africa"')['2007'])
print('---')
print()
Output:
continent
Africa 27.583333
Americas 44.416667
Asia 43.500000
Europe 32.750000
Oceania 38.583333
dtype: float64
---
44.416666666666664
---
continent
Africa 44
Name: 2007, dtype: int64
---

Categories

Resources