Pandas - Unstack/pivot a dataframe with pandas

Pandas - Unstack/pivot a dataframe with pandas - python

I have a dataframe that looks like this:
Column A
Column B
Category
1
7
A
2
8
A
3
9
B
4
10
B
5
11
C
6
12
C
I would like to write code to produce the following dataframe:
Category A
Category B
Category C
Column A
Column B
Column A
Column B
Column A
Column B
1
7
3
9
5
11
2
8
4
10
6
12
I've tried pd.pivot_table, but am not able to figure it out. Can someone help me with this please? Thanks!

You can create a dummy index to use pivot table with:
out = df.pivot_table(
columns="Category",
index=df.groupby("Category").cumcount()
)
which has output:
Column A Column B
Category A B C A B C
0 1 3 5 7 9 11
1 2 4 6 8 10 12
I don't know if there's any simple way to rearrange the columns to be in your format within pivot_table itself. Here is a way by doing some post processing:
final = out.swaplevel(axis=1).sort_index(axis=1, level=0)
final:
Category A B C
Column A Column B Column A Column B Column A Column B
0 1 7 3 9 5 11
1 2 8 4 10 6 12

The issue is that you cannot identify each row uniquely to be able to apply pivot. To this end, create a "within-group" index as follows.
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
Column A;Column B;Category
1;7;A
2;8;A
3;9;B
4;10;B
5;11;C
6;12;C
"""
)
df = pd.read_csv(data, sep=";")
# assign a within-group index
df['id'] = df.groupby('Category').cumcount()
# now apply pivot
df = df.pivot(index='id', columns='Category', values=['Column A', 'Column B'])
Now, you can apply swaplevel and sort_index to match the desired result
df.swaplevel(axis=1).sort_index(axis=1)

Related

Adding pandas series on end of each pandas dataframe's row

I've had issues finding a concise way to append a series to each row of a dataframe, with the series labels becoming new columns in the df. All the values will be the same on each of the dataframes' rows, which is desired.
I can get the effect by doing the following:
df["new_col_A"] = ser["new_col_A"]
.....
df["new_col_Z"] = ser["new_col_Z"]
But this is so tedious there must be a better way, right?

Given:
# df
A B
0 1 2
1 1 3
2 4 6
# ser
C a
D b
dtype: object
Doing:
df[ser.index] = ser
print(df)
Output:
A B C D
0 1 2 a b
1 1 3 a b
2 4 6 a b

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!

You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])

Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way

import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5

You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22

You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?

First, I show the pandas dataframe to elucidate my problem.
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
this python code creates dataframe(df1) like this:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
I want to create columns 'c*d' on lv2 by using df1's data. like this:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
For this problem,I wrote some code like this:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
Although this code almost solved my problem, but I really want to write without 'for' statement like this:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
With this statement,I got Key error that says 'c*d' is missing.
Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?

A bit improved your solution:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Another solution with stack and unstack:
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132

Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Unstack/pivot a dataframe with pandas - python

Related

Adding pandas series on end of each pandas dataframe's row

Dropping multiple columns in a pandas dataframe between two columns based on column names

Pandas melt function using column index positions rather than colum names

Compare columns in Pandas between two unequal size Dataframes for condition check

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?

Categories

Resources