Python - Creating a custom dataframe from transposing an existing one - python

I have a dataframe that consists of the following enteries:
The dataset consists information about each country for about 10 years for 5 indicators as represented above.
I am trying to convert Indicator column into rows for all the 5 indicators and similarly convert each year column into one merged columns.
Ideally the final output should look like:
So, the country column should have extra enetries according to the number of years and the values should transpose according to each indicator.
I tried using pandas in built functions such as melt and pivot but not getting anywhere.
Any guidance on this would be appricated. Thanks!

You can use stack and unstack:
out = (df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack('Year')
.unstack('Indicator').rename_axis(columns=None)
.reset_index())
print(out)
# Output
Country Name Year I J K L
0 A 2008 0.002535 0.966967 0.033397 0.487713
1 A 2009 0.797714 0.642878 0.752803 0.527796
2 A 2010 0.773789 0.288100 0.013059 0.918957
3 B 2008 0.979707 0.968245 0.820731 0.309862
4 B 2009 0.086082 0.608920 0.629591 0.156926
5 B 2010 0.983092 0.536192 0.380157 0.091473
6 C 2008 0.834870 0.145200 0.225985 0.686520
7 C 2009 0.771646 0.834432 0.519951 0.651756
8 C 2010 0.003791 0.292212 0.257748 0.473694
Alternative with melt and pivot:
out = (df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot(['Country Name', 'Year'], 'Indicator', 'value')
.rename_axis(columns=None).reset_index())
Input data:
>>> df
Country Name Indicator 2008 2009 2010
0 A I 0.002535 0.797714 0.773789
1 A J 0.966967 0.642878 0.288100
2 A K 0.033397 0.752803 0.013059
3 A L 0.487713 0.527796 0.918957
4 B I 0.979707 0.086082 0.983092
5 B J 0.968245 0.608920 0.536192
6 B K 0.820731 0.629591 0.380157
7 B L 0.309862 0.156926 0.091473
8 C I 0.834870 0.771646 0.003791
9 C J 0.145200 0.834432 0.292212
10 C K 0.225985 0.519951 0.257748
11 C L 0.686520 0.651756 0.473694

Related

Pandas - Unstack/pivot a dataframe with pandas

I have a dataframe that looks like this:
Column A
Column B
Category
1
7
A
2
8
A
3
9
B
4
10
B
5
11
C
6
12
C
I would like to write code to produce the following dataframe:
Category A
Category B
Category C
Column A
Column B
Column A
Column B
Column A
Column B
1
7
3
9
5
11
2
8
4
10
6
12
I've tried pd.pivot_table, but am not able to figure it out. Can someone help me with this please? Thanks!
You can create a dummy index to use pivot table with:
out = df.pivot_table(
columns="Category",
index=df.groupby("Category").cumcount()
)
which has output:
Column A Column B
Category A B C A B C
0 1 3 5 7 9 11
1 2 4 6 8 10 12
I don't know if there's any simple way to rearrange the columns to be in your format within pivot_table itself. Here is a way by doing some post processing:
final = out.swaplevel(axis=1).sort_index(axis=1, level=0)
final:
Category A B C
Column A Column B Column A Column B Column A Column B
0 1 7 3 9 5 11
1 2 8 4 10 6 12
The issue is that you cannot identify each row uniquely to be able to apply pivot. To this end, create a "within-group" index as follows.
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
Column A;Column B;Category
1;7;A
2;8;A
3;9;B
4;10;B
5;11;C
6;12;C
"""
)
df = pd.read_csv(data, sep=";")
# assign a within-group index
df['id'] = df.groupby('Category').cumcount()
# now apply pivot
df = df.pivot(index='id', columns='Category', values=['Column A', 'Column B'])
Now, you can apply swaplevel and sort_index to match the desired result
df.swaplevel(axis=1).sort_index(axis=1)

Create a pivot table with yearly index for each value

I'm currently dealing with a worldbank dataset and I would like to pivot it to analyse it better.
Currently, it's in a form like this :
df = pd.DataFrame({'Country Name' : ['A','A','A','B','B','B','C','C','C'],'Indicator' : ['X','Y','Z','X','Y','Z','X','Y','Z'],'2010' : [1,2,3,4,5,6,7,8,9],'2011' : [9,8,7,6,5,4,3,2,1]})
print(df)
Country Name Indicator 2010 2011
0 A X 1 9
1 A Y 2 8
2 A Z 3 7
3 B X 4 6
4 B Y 5 5
5 B Z 6 4
6 C X 7 3
7 C Y 8 2
8 C Z 9 1
I simplified it, there's more columns with more yearly value but that's good enough to explain. Now I would like to get that table to pivot, so that I could get the indicators in columns, with the countries in index and the yearly values for the values. However for that I would like to have a df like this :
year
indicator1
indicator 2
indicator3
Country A
2010
values
.
.
2011
.
.
.
Country B
2010
.
.
.
2011
.
.
.
Country C
2010
.
.
.
But since the values are stored in the year columns, I don't know how to transform it to get to this disposition
I tried doing it like this :
indicators = data.pivot_table(index='Country Name',columns='Indicator Name', values='2011', dropna=False)
This ofc works, but I just get the values for the year 2011 and I don't really want to create a dataframe for each year.
But I can't add the name of the year columns in the index as it creates an enormous dataframe for some reason, and if i just add them in the values it will only create more columns, with some kind of index on the columns to separate the years.
I don't know if there's a method to do what I'm looking for but I appreciate any help !
Thanks
melt before pivot_table:
(df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot_table(index=['Country Name', 'Year'],
columns='Indicator', values='value')
)
Or reshape with stack/unstack (requires unique values, unlike pivot_table):
(df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack().unstack('Indicator')
)
output:
Indicator X Y Z
Country Name Year
A 2010 1 2 3
2011 9 8 7
B 2010 4 5 6
2011 6 5 4
C 2010 7 8 9
2011 3 2 1

Creating row numbers based on a column value

I have the following dataset as input in my jupyter notebook:
Product Year Variable
A 2018 2
A 2019 4
B 2018 2
B 2019 3
I'm wondering what would be the quickest way to create a loop or something or that sorts in my data set, such that I get the following output:
Product Year Variable Row_Num
A 2018 2 1
A 2018 2 2
A 2019 4 1
A 2019 4 2
A 2019 4 3
A 2019 4 4
B 2018 2 1
B 2018 2 2
and so on...
TL;DR - Based on a variable in a particular column, I would like to create rows. Ex- if the variable is 3, I would like to create 3 copies of that row with a column that has values 1,2,3 against it.
One of the ways I think I found is to first create duplicates based on my variable and then use a function similar to rank() or row_number() to create my "row_num" column. It would be helpful if anyone can share other possible ways to do the same. 😄
If I understand correctly, you'd like to create n duplicates of each row, where the value of n is given in one of the columns. Here's a way to do that:
df["new_id"] = df.Variable.apply(lambda x: list(range(x)))
df = df.explode("new_id")
Output:
Product Year Variable new_id
0 A 2018 2 0
0 A 2018 2 1
1 A 2019 4 0
1 A 2019 4 1
1 A 2019 4 2
1 A 2019 4 3
2 B 2018 2 0
2 B 2018 2 1
3 B 2019 3 0
3 B 2019 3 1
3 B 2019 3 2
Solution for Pandas <= 0.24
If for whatever reason, explode is not available because you're using an older version of pandas, you can do the following:
cols = df.columns
def make_df(r):
d = {k: r[k] for k in cols}
d["new_var"] = range(r["Variable"])
res = pd.DataFrame(d)
return res
dfs = []
for row in df.iterrows():
dfs.append(make_df(row[1]))
pd.concat(dfs)
The output is identical.

Index duplicate rows in Python DataFrame

I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3

Sum duplicated rows on a multi-index pandas dataframe

Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks
You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5

Categories

Resources