Importing Excel data with merging cells - python

How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?

You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7

Related

Convert columns into rows data with Pandas

my dataset has some information by location for n dates. The CSV looks like
Country year2018 year2019 year2020
saleA saleB SaleA SaleB saleA saleB
USA 22 23 323 32 31 65
china 12 12 2 66 66 78
I want my data to be of the form
Country year saleA saleB
USA year2018 22 23
USA year2019 323 32
USA year2020 31 65
china year2018 12 12
.
.
.
How can I do it using pandas?
I tried using pd.melt but couldn't figured out.
You can reshape your dataframe with set_index and stack:
out = (df.set_index('Country')
.rename_axis(columns=['year', None])
.stack('year').reset_index())
Country year saleA saleB
0 USA year2018 22 23
1 USA year2019 323 32
2 USA year2020 31 65
3 China year2018 12 12
4 China year2019 2 66
5 China year2020 66 78
Another solution with melt and pivot_table:
>>> out = (df.melt(id_vars='Country', var_name=['year', 'sale'])
.pivot_table(index=['Country', 'year'], columns='sale', values='value')
.reset_index())

Subtract value of column based on another column

I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27
IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

How to take mean across row in Pandas pivot table Dataframe? [duplicate]

This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
I have a pandas dataframe as seen below which is a pivot table. I would like to print Africa in 2007 as well as do the mean of the entire Americas row; any ideas how to do this? I have been doing combinations of stack/unstack for a while now to no avail.
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 12 13 15 20 39 25 81 12 22 23 25 44
Americas 12 14 65 10 119 15 21 42 47 84 15 89
Asia 12 13 89 20 39 25 81 29 77 23 25 89
Europe 12 13 15 20 39 25 81 29 23 32 15 89
Oceania 12 13 15 20 39 25 81 27 32 85 25 89
import pandas as pd
df = pd.read_csv('dummy_data.csv')
# handy to see the continent name against the value rather than '0' or '3'
df.set_index('continent', inplace=True)
# print mean for all rows - see how the continent name helps here
print(df.mean(axis=1))
print('---')
print()
# print the mean for just the 'Americas' row
print(df.mean(axis=1)['Americas'])
print('---')
print()
# print the value of the 'Africa' row for the year (column) 2007
print(df.query('continent == "Africa"')['2007'])
print('---')
print()
Output:
continent
Africa 27.583333
Americas 44.416667
Asia 43.500000
Europe 32.750000
Oceania 38.583333
dtype: float64
---
44.416666666666664
---
continent
Africa 44
Name: 2007, dtype: int64
---

DataFrame: add same data for different column and merge the whole file

I have a DataFrame which looks like this:
Name Year Jan Feb Mar Apr
Bee 1998 26 23 22 19
Cee 1999 43 23 43 23
I want to change the DataFrame into something like this:
Name Year Mon Val
Bee 1998 1 26
Bee 1998 2 23
Bee 1998 3 22
Bee 1998 4 19
Cee 1999 1 43
Cee 1999 2 23
Cee 1999 3 43
Cee 1999 4 23
How do i acquire this in Python with Pandas or any other library?
First, reshape your DataFrame with pd.DataFrame.melt:
df = df.melt(id_vars=['Name', 'Year'], var_name='Mon', value_name='Value')
...and then convert your Mon values to datetime values, and extract the month number:
df.loc[:, 'Mon'] = pd.to_datetime(df['Mon'], format='%b').dt.month
# Name Year Mon Value
# 0 Bee 1998 1 26
# 1 Cee 1999 1 43
# 2 Bee 1998 2 23
# 3 Cee 1999 2 23
# 4 Bee 1998 3 22
# 5 Cee 1999 3 43
# 6 Bee 1998 4 19
# 7 Cee 1999 4 23
df = df.set_index(['Name', 'Year'])
df.columns = pd.to_datetime(df.columns, format='%b').month
df.stack()
returns
Name Year
Bee 1998 1 26
2 23
3 22
4 19
Cee 1999 1 43
2 23
3 43
4 23
dtype: int64

Pandas: transform column's values in independent columns

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?
You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

Categories

Resources