Transpose or Pivot multiple columns in Pandas - python

I would like to transpose multiple columns in a dataframe. I have looked through most of the transpose and pivot pandas posts but could not get it to work.
Here is what my dataframe looks like.
df = pd.DataFrame()
df['L0'] = ['fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable']
df['L1'] = ['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'tomato', 'tomato', 'tomato', 'lettuce', 'lettuce', 'lettuce']
df['Type'] = ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z']
df['A'] = [3, 0, 4, 3, 1, 3, 2, 2, 2, 4, 2, 4]
df['B'] = [3, 1, 0, 4, 1, 4, 4, 4, 2, 1, 2, 1]
df['C'] = [0, 4, 1, 0, 2, 4, 1, 1, 2, 3, 2, 3]
I would like to transpose/pivot columns A, B and C and replace them with values from column "Type". Resulting dataframe should look like this.
df2 = pd.DataFrame()
df2['L0'] = ['fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable']
df2['L1'] = ['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'tomato', 'tomato', 'tomato', 'lettuce', 'lettuce', 'lettuce']
df2['Type2'] = ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
df2['X'] = [3, 3, 0, 3, 4, 0, 2, 4, 1, 4, 1, 3]
df2['Y'] = [0, 1, 4, 1, 1, 2, 2, 4, 1, 2, 2, 2]
df2['Z'] = [4, 0, 1, 3, 4, 4, 2, 2, 2, 4, 1, 3]
The best I could do was this
df.groupby(['L0', 'L1', 'Type'])['A', 'B', 'C'].sum().unstack('Type')
But this is not really what I want. Thank you!

Add stack before unstack:
df = (df.groupby(['L0', 'L1', 'Type'])['A', 'B', 'C']
.sum()
.stack()
.unstack('Type')
.reset_index()
.rename_axis(None, axis=1)
.rename(columns={'level_2':'Type2'}))
print (df)
L0 L1 Type2 X Y Z
0 fruit apple A 3 0 4
1 fruit apple B 3 1 0
2 fruit apple C 0 4 1
3 fruit banana A 3 1 3
4 fruit banana B 4 1 4
5 fruit banana C 0 2 4
6 vegetable lettuce A 4 2 4
7 vegetable lettuce B 1 2 1
8 vegetable lettuce C 3 2 3
9 vegetable tomato A 2 2 2
10 vegetable tomato B 4 4 2
11 vegetable tomato C 1 1 2

Related

Aggregating results based on three variables

I have a dataframe as shown below
import pandas as pd
data = {
'id': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
'date': ['2021-03-15', '2021-03-15', '2021-03-17', '2021-03-17', '2021-03-12', '2021-03-12', '2021-12-14', '2021-04-07', '2021-07-09', '2021-04-25', '2021-04-25'],
'n': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2],
'type': ['A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'A'],
't': [1.41, 1.05, 2.01, 0.79, 1.37, 2.19, 1.28, 1.9, 0.97, 1.48, 1.96],
'leq': [73.95284344, 75.08732477, 42.52073186, 14.16069694, 59.36296547, 48.7827182, 44.48691532, 63.63032644, 95.20787662, 61.38061937, 12.50041565]
}
df = pd.DataFrame(data)
and would like to aggregate the values based on three variables id, date and type using the formula below
In other words, the aggregation will encompass the three variables
Thanks in advance!
Seems like a direct application of groupby and your mathematical formula:
df.groupby(['id', 'date', 'type'])\
.apply(lambda s: 10 * np.log(1/(s['t'].sum()) * np.sum(s['t'] * (10**(s['leq']/10)))))
id date type
1 2021-03-15 A 171.482002
2021-03-17 B 94.598488
2 2021-03-12 B 128.447851
2021-12-14 B 102.434908
3 2021-04-07 B 146.514241
2021-04-25 A 28.783271
B 141.334099
2021-07-09 A 219.224237
dtype: float64

Create dataframe from values/columns from another dataframe [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 months ago.
I hava a dataframe like this:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], id: [25, 15, 30]})
I would like to use the values of df1 (and their respective columns) as a basis for filling in df2.
Expected:
expected = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'], 'value': [1, 'a', 4, 'e', 7, 2, 'b', 5, 'f', 8], 'id': [25, 15]})
I tried using iterrows, but as I need to use it for a large amount of data, the performance results were not positive. Can you help me?
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], 'id': [25, 15, 30]})
pd.melt(df1, id_vars=['id'], var_name = 'column')
id column value
0 25 A 1
1 15 A 2
2 30 A 3
3 25 B a
4 15 B b
5 30 B c
6 25 C 4
7 15 C 5
8 30 C 6
9 25 D e
10 15 D f
11 30 D g
12 25 E 7
13 15 E 8
14 30 E 9
Have you tried Dataframe.melt? I guess something like this could do the trick:
df1.melt(ignore_index=False).merge(
df1, left_index=True, right_index=True
)[['variable', 'value', 'id']].reset_index()
There are some rows to be ignored, but that should be easy. I don't now about performance regarding large data frames, though.

Comparing 2 dataframes by ID

I am very new to Python. I want to compare two dataframes. They both have the same columns, first column is the key variable (ID). My goal is to print the differences.
For example:
import pandas as pd
import numpy as np
dframe1 = {'ID': [1, 2, 3, 4, 5], 'Apple': ['C', 'B', 'C', 'A', 'E'], 'Pear': [2, 3, 5, 6, 7]}
dframe2 = {'ID': [4, 2, 1, 3], 'Apple': ['A', 'C', 'C', 'C'], 'Pear': [6, 'NA', 'NA', 5]}
df1 = pd.DataFrame(dframe1)
df2 = pd.DataFrame(dframe2)
import datacompy
compare=datacompy.Compare(
df1,
df2,
df1_name='Reference',
df2_name='Test',
on_index=True
)
print(compare.report())
This produces a comparison report but I want my output to be like the following. Columns of my desired output:
out1 = {'var.x': ['Apple', 'Pear', 'Pear'], 'var.Y': ['Apple', 'Pear', 'Pear'], 'ID': [2, 1, 2],'values.x': ['B', '2', '3'], 'values.Y': ['C','NA','NA'],'row.x': [2, 1, 4], 'row.y': [2, 3, 1]}
outp = pd.DataFrame(out1)
print(outp)
Thanks a lot for your support.

Can I make 4 new columns aggregating 4 previous ones?

I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!
df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!
You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7

What is the 'name' in pandas.DataFrame.columns?

When I execute a pivot on a pandas dataframe,
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df.pivot(index='foo', columns='bar', values='baz')
>>> bar A B C
foo
one 1 2 3
two 4 5 6
Which has these columns,
df.pivot(index='foo', columns='bar', values='baz').columns
>>> Index(['A', 'B', 'C'], dtype='object', name='bar')
My question is, what does name=bar part mean?
From the docs
name : object
Name to be stored in the index
In your example, it's the name of the pandas.Index that is used as the column name.
The name attribute becomes useful in some cases, for instance if you have a multiindex, you can refer to the level of the index by it's name:
>>> df
idx1 1 2 3 # <- column header 1
idx2 a b c # <- column header 2
vals 5 4 6
>>> df.columns
MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]],
names=['idx1', 'idx2'])
>>> df.columns.get_level_values('idx1')
Int64Index([1, 2, 3], dtype='int64', name='idx1')
>>> df.columns.get_level_values('idx2')
Index(['a', 'b', 'c'], dtype='object', name='idx2')

Categories

Resources