I'm currently dealing with a worldbank dataset and I would like to pivot it to analyse it better.
Currently, it's in a form like this :
df = pd.DataFrame({'Country Name' : ['A','A','A','B','B','B','C','C','C'],'Indicator' : ['X','Y','Z','X','Y','Z','X','Y','Z'],'2010' : [1,2,3,4,5,6,7,8,9],'2011' : [9,8,7,6,5,4,3,2,1]})
print(df)
Country Name Indicator 2010 2011
0 A X 1 9
1 A Y 2 8
2 A Z 3 7
3 B X 4 6
4 B Y 5 5
5 B Z 6 4
6 C X 7 3
7 C Y 8 2
8 C Z 9 1
I simplified it, there's more columns with more yearly value but that's good enough to explain. Now I would like to get that table to pivot, so that I could get the indicators in columns, with the countries in index and the yearly values for the values. However for that I would like to have a df like this :
year
indicator1
indicator 2
indicator3
Country A
2010
values
.
.
2011
.
.
.
Country B
2010
.
.
.
2011
.
.
.
Country C
2010
.
.
.
But since the values are stored in the year columns, I don't know how to transform it to get to this disposition
I tried doing it like this :
indicators = data.pivot_table(index='Country Name',columns='Indicator Name', values='2011', dropna=False)
This ofc works, but I just get the values for the year 2011 and I don't really want to create a dataframe for each year.
But I can't add the name of the year columns in the index as it creates an enormous dataframe for some reason, and if i just add them in the values it will only create more columns, with some kind of index on the columns to separate the years.
I don't know if there's a method to do what I'm looking for but I appreciate any help !
Thanks
melt before pivot_table:
(df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot_table(index=['Country Name', 'Year'],
columns='Indicator', values='value')
)
Or reshape with stack/unstack (requires unique values, unlike pivot_table):
(df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack().unstack('Indicator')
)
output:
Indicator X Y Z
Country Name Year
A 2010 1 2 3
2011 9 8 7
B 2010 4 5 6
2011 6 5 4
C 2010 7 8 9
2011 3 2 1
Related
I have a dataframe that consists of the following enteries:
The dataset consists information about each country for about 10 years for 5 indicators as represented above.
I am trying to convert Indicator column into rows for all the 5 indicators and similarly convert each year column into one merged columns.
Ideally the final output should look like:
So, the country column should have extra enetries according to the number of years and the values should transpose according to each indicator.
I tried using pandas in built functions such as melt and pivot but not getting anywhere.
Any guidance on this would be appricated. Thanks!
You can use stack and unstack:
out = (df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack('Year')
.unstack('Indicator').rename_axis(columns=None)
.reset_index())
print(out)
# Output
Country Name Year I J K L
0 A 2008 0.002535 0.966967 0.033397 0.487713
1 A 2009 0.797714 0.642878 0.752803 0.527796
2 A 2010 0.773789 0.288100 0.013059 0.918957
3 B 2008 0.979707 0.968245 0.820731 0.309862
4 B 2009 0.086082 0.608920 0.629591 0.156926
5 B 2010 0.983092 0.536192 0.380157 0.091473
6 C 2008 0.834870 0.145200 0.225985 0.686520
7 C 2009 0.771646 0.834432 0.519951 0.651756
8 C 2010 0.003791 0.292212 0.257748 0.473694
Alternative with melt and pivot:
out = (df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot(['Country Name', 'Year'], 'Indicator', 'value')
.rename_axis(columns=None).reset_index())
Input data:
>>> df
Country Name Indicator 2008 2009 2010
0 A I 0.002535 0.797714 0.773789
1 A J 0.966967 0.642878 0.288100
2 A K 0.033397 0.752803 0.013059
3 A L 0.487713 0.527796 0.918957
4 B I 0.979707 0.086082 0.983092
5 B J 0.968245 0.608920 0.536192
6 B K 0.820731 0.629591 0.380157
7 B L 0.309862 0.156926 0.091473
8 C I 0.834870 0.771646 0.003791
9 C J 0.145200 0.834432 0.292212
10 C K 0.225985 0.519951 0.257748
11 C L 0.686520 0.651756 0.473694
I have the following dataset as input in my jupyter notebook:
Product Year Variable
A 2018 2
A 2019 4
B 2018 2
B 2019 3
I'm wondering what would be the quickest way to create a loop or something or that sorts in my data set, such that I get the following output:
Product Year Variable Row_Num
A 2018 2 1
A 2018 2 2
A 2019 4 1
A 2019 4 2
A 2019 4 3
A 2019 4 4
B 2018 2 1
B 2018 2 2
and so on...
TL;DR - Based on a variable in a particular column, I would like to create rows. Ex- if the variable is 3, I would like to create 3 copies of that row with a column that has values 1,2,3 against it.
One of the ways I think I found is to first create duplicates based on my variable and then use a function similar to rank() or row_number() to create my "row_num" column. It would be helpful if anyone can share other possible ways to do the same. 😄
If I understand correctly, you'd like to create n duplicates of each row, where the value of n is given in one of the columns. Here's a way to do that:
df["new_id"] = df.Variable.apply(lambda x: list(range(x)))
df = df.explode("new_id")
Output:
Product Year Variable new_id
0 A 2018 2 0
0 A 2018 2 1
1 A 2019 4 0
1 A 2019 4 1
1 A 2019 4 2
1 A 2019 4 3
2 B 2018 2 0
2 B 2018 2 1
3 B 2019 3 0
3 B 2019 3 1
3 B 2019 3 2
Solution for Pandas <= 0.24
If for whatever reason, explode is not available because you're using an older version of pandas, you can do the following:
cols = df.columns
def make_df(r):
d = {k: r[k] for k in cols}
d["new_var"] = range(r["Variable"])
res = pd.DataFrame(d)
return res
dfs = []
for row in df.iterrows():
dfs.append(make_df(row[1]))
pd.concat(dfs)
The output is identical.
I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3
Here is test data
import numpy as np
import pandas as pd
import datetime
# multi-indexed dataframe via cartesian join
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(pd.date_range(start='2016', end='2018', freq='M'))
df1['key'] = 0
df2['key'] = 0
df = df1.merge(df2, how='outer', on='key')
del df1, df2
del df['key']
df.columns = ['id','date']
df['value'] = pd.DataFrame(np.random.randn(len(df)))
df.set_index(['date', 'id'], inplace=True)
df.sort_index(inplace=True)
df.head()
output:
value
date id
2016-01-31 1 0.245029
2 -2.141292
3 1.521566
2016-02-29 1 0.870639
2 1.407977
There is probably a better way to generate the cartesian join, but I'm new and that is the best I could find to generate panel data that looks like mine. Anyway, my goal is to create a quick table looking at the pattern of observations to see if any are missing as it relates to time.
My goal is to create a year by month table of frequency observations. This is close to what I want:
df.groupby(pd.Grouper(level='date',freq='M')).count()
But it gives a vertical list. My data is much bigger than this small MWE so I'd like to fit it more compactly, as well as see if there are seasonal patterns (i.e. lots of observations in December or June).
It seems to me that this should work but it doesn't:
df.groupby([df.index.levels[0].month, df.index.levels[0].year]).count()
I get a ValueError: Grouper and axis must be same length error.
This gives what I'm looking for but it seems to me that it should be easier with the time index:
df.reset_index(inplace=True)
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df.groupby(['month', 'year'])['value'].count().unstack().T
output:
month 1 2 3 4 5 6 7 8 9 10 11 12
year
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Also, since this is just a quick validation, I'd rather not reset the index, then re-establish the index (and delete month and year) each time just to see this table.
I think need Index.get_level_values for select first level of MultiIndex:
idx = df.index.get_level_values(0)
df1 = df.groupby([idx.year, idx.month])['value'].count().unstack()
Or:
df1 = df.groupby([idx.year, idx.month]).size().unstack()
Difference between count and size is count omit NaNs and size not.
print (df1)
date 1 2 3 4 5 6 7 8 9 10 11 12
date
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks
You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5