Pandas: pivoting rows to columns with columns as column-row - python

I have a data frame that looks like this
df = pd.DataFrame({'A': [1,2,3], 'B': [11,12,13]})
df
A B
0 1 11
1 2 12
2 3 13
I would like to create the following data frame where the columns are a combination of each column-row
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13
It seems that the pivot and transpose functions will switch columns and rows but I actually want to flatten the data frame to a single row. How can I achieve this?

IIUC
s=df.stack().sort_index(level=1).to_frame(0).T
s.columns=s.columns.map('{0[1]}{0[0]}'.format)
s
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13

One option, with pivot_wider:
# pip install pyjanitor
import janitor
import pandas as pd
df.index = [0] * len(df)
df = df.assign(num=range(len(df)))
df.pivot_wider(names_from="num", names_sep = "")
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13

Related

How to turn convert rows to columns in pandas?

I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc

transpose dataframe changing columns names

Could you please help me in transforming the dataframe df
df=pd.DataFrame(data=[['a1',2,3],['b1',5,6],['c1',8,9]],columns=['A','B','C'])
df
Out[37]:
A B C
0 a1 2 3
1 b1 5 6
2 c1 8 9
in df2
df2=pd.DataFrame(data=[[2,5,8],[3,6,9]],columns=['a1','b1','c1'])
df2
Out[36]:
a1 b1 c1
0 2 5 8
1 3 6 9
The first column should become the column names
and then I should transpose the elements...is there a pythonic way?
A little trick with slicing, initialise a new DataFrame.
pd.DataFrame(df.values.T[1:], columns=df.A.tolist())
Or,
pd.DataFrame(df.values[:, 1:].T, columns=df.A.tolist())
a1 b1 c1
0 2 5 8
1 3 6 9
For general solution use set_index with transpose:
df1 = df.set_index('A').T.reset_index(drop=True).rename_axis(None)
Or remove column A, transpose and build new DataFrame by constructor:
df1 = pd.DataFrame(df.drop('A', 1).T.values, columns=df['A'].values)
print (df1)
a1 b1 c1
0 2 5 8
1 3 6 9

Pandas Multiindex Groupby aggregate column with value from another column

I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4
Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0

Python pandas pivot multiindex

Input
I have the following file input.txt:
D E F G H
a 1 b 1 4
a 1 c 1 5
b 2 c 2 6
Desired output
How can I create a new data frame, that uses columns D and E as an index? I want a triangular matrix that looks something like this:
a1 b1 c1 b2 c2
a1 0 4 5 0 0
b1 0 0 0 0
c1 0 0 0
b2 0 6
c2 0
1st attempt
I am importing the data frame and I am trying to do a pivot like this:
import pandas as pd
df1 = pd.read_csv(
'input.txt', index_col=[0,1], delim_whitespace=True,
usecols=['D','E','F','G','H'])
df2 = df1.pivot(index=['D', 'E'], columns=['F','G'], values='H')
df1 looks like this:
F G H
D E
a 1 b 1 4
1 c 1 5
b 2 c 2 6
df1.index looks like this:
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1], [0, 0, 1]],
names=['D', 'E'])
df2 fails to be generated and I get this error message:
`KeyError: "['D' 'E'] not in index"`
2nd attempt
I thought I had solved it like this:
import pandas as pd
df = pd.read_csv(
'input.txt', delim_whitespace=True,
usecols=['D','E','F','G','H'],
dtype={'D':str, 'E':str, 'F':str, 'G':str, 'H':float},
)
pivot = pd.pivot_table(df, values='H', index=['D','E'], columns=['F','G'])
pivot looks like this:
F b c
G 1 1 2
D E
a 1 4 5 NaN
b 2 NaN NaN 6
But when I try to do this to convert it to a symmetric matrix:
pivot.add(df.T, fill_value=0).fillna(0)
Then I get this error:
ValueError: cannot join with no level specified and no overlapping names
3rd attempt and solution
I found a solution here. It is also what #Moritz suggested, but I'm new to pandas and didn't understand his comment. I did this:
import pandas as pd
df1 = pd.read_csv(
'input.txt', index_col=[0,1], delim_whitespace=True,
usecols=['D','E','F','G','H'],
dtype={'D':str, 'E':str, 'F':str, 'G':str, 'H':float}
)
df1['DE'] = df1['D']+df1['E']
df1['FG'] = df1['F']+df1['G']
df2 = df1.pivot(index='DE', columns='FG', values='H')
This data frame is generated:
FG b1 c1 c2
DE
a1 4 5 NaN
b2 NaN NaN 6
Afterwards I do df3 = df2.add(df2.T, fill_value=0).fillna(0) to convert the triangular matrix to a symmetric matrix. Is generating new columns really the easiest way to accomplish what I want? My reason for doing all of this is that I want to generate a heat map with matplotlib and hence need the data to be in matrix form. The final matrix/dataframe looks like this:
a1 b1 b2 c1 c2
a1 0 4 0 5 0
b1 4 0 0 0 0
b2 0 0 0 0 6
c1 5 0 0 0 0
c2 0 0 6 0 0

Flatten DataFrame with multi-index columns

I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal

Categories

Resources