How to retain duplicate column names and melt dataframe using pandas?

How to retain duplicate column names and melt dataframe using pandas? - python

I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below

First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()

Related

Using melt() in Pandas

Trying to melt the following data such that the result is 4 rows - one for "John-1" containing his before data, one for "John-2" containing his after data, one for "Kelly-1" containing her before data, and one for "Kelly-2" containing her after data. And the columns would be "Name", "Weight", and "Height". Can this be done solely with the melt function?
df = pd.DataFrame({'Name': ['John', 'Kelly'],
'Weight Before': [200, 175],
'Weight After': [195, 165],
'Height Before': [6, 5],
'Height After': [7, 6]})

Use pandas.wide_to_long function as shown below:
pd.wide_to_long(df, ['Weight', 'Height'], 'Name', 'grp', ' ', '\\w+').reset_index()
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6
or you could also use pivot_longer from pyjanitor as follows:
import janitor
df.pivot_longer('Name', names_to = ['.value', 'grp'], names_sep = ' ')
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6

Dropping rows using multiple criteria

I have the following data frame:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000001
ABC
1/1/2017
80292
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
E000003
ABC
5/1/1997
1000
What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:
EID
CLEAN_NAME
Start_Date
End_Date
Rank_no
E000000
DEF
3/1/1973
2/28/1978
154
E000001
GHI
6/1/1983
3/31/1988
1296
E000002
JKL
10/1/1980
8/31/1981
751.5
E000003
MNO
5/1/1973
11/30/1977
157
E000003
ABC
5/1/1977
11/30/1987
200
E000003
PQR
5/1/1987
11/30/1997
300
I tried to use the following code:
result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)
But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.
Any leads would be appreciated. Thanks.!

Use groupby and idxmax to find the max index for each respective EID and CLEAN_NAME combo after filtering down to only the rows that have ABC.
df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

import pandas as pd
datas = [
['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
['E000001', 'ABC', '1/1/2017', '', 80292],
['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
['E000003', 'ABC', '5/1/1997', '', 1000],
]
result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])
new_result = result.sort_values(by='Rank_no') # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first') # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID') # sort by EID
print(new_result)
Output :
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0

sort columns names only not rows in python

Hi i have df_input which needs to be sorted for only column names not by rows(restructuring the dataframe)
df_input.columns
Out[143]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d4', 'v_d2', 'v_d3'], dtype='object')
My required output column names should be sorted after N columns(here after 3 columns)
df_out.columns
Out[144]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d2', 'v_d3', 'v_d4'], dtype='object')
My input dataframe is as follows:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10]
}
df_input = pd.DataFrame(data)
print (df)
Required output dataframe:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10],
'v_d4':[66,12,55,7,89]
}
df_out = pd.DataFrame(data)
Thanks in advance

If values of columns names are from 0 to 9 is possible use sorted columns with slicing:
df = df[df.columns[:3].tolist() + sorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d2 v_d3 v_d4
0 laptop 1200 Dell 2 54 88 66
1 printer 150 hp 44 12 69 12
2 tablet 300 Lenove 55 45 37 55
3 desktop 450 iPhone 2 77 15 7
4 chair 200 xyz 1 23 10 89
More general solution with natural sorting:
from natsort import natsorted
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d10':[54,12,45,77,23],
'v_d20':[88,69,37,15,10]
}
df = pd.DataFrame(data)
df = df[df.columns[:3].tolist() + natsorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d4 v_d10 v_d20
0 laptop 1200 Dell 2 66 54 88
1 printer 150 hp 44 12 12 69
2 tablet 300 Lenove 55 55 45 37
3 desktop 450 iPhone 2 7 77 15
4 chair 200 xyz 1 89 23 10

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]

import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.

You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

Python - Shift Data to a different ROW

I am kind of new to data frames and been trying to find a solution to shift data in specific columns to another row if ID and CODE matches.
df = pd.DataFrame({'ID': ['123', '123', '123', '154', '167', '167'],
'NAME': ['Adam', 'Adam', 'Adam', 'Bob', 'Charlie', 'Charlie'],
'CODE': ['1001', '1001', '1011', '1002', 'A0101', 'A0101'],
'TAG1': ['A123', 'B123', 'K123', 'D123', 'E123', 'G123'],
'TAG2': [ np.NaN, 'C123', 'L123', np.NaN, 'F123', 'H123'],
'TAG3': [ np.NaN, 'M123', np.NaN, np.NaN, np.NaN, np.NaN]})
ID NAME CODE TAG1 TAG2 TAG3
0 123 Adam 1001 A123 NaN NaN
1 123 Adam 1001 B123 C123 M123
2 123 Adam 1011 K123 L123 NaN
3 154 Bob 1002 D123 NaN NaN
4 167 Charlie A0101 E123 F123 NaN
5 167 Charlie A0101 G123 H123 NaN
So above I have added the code which depicts the initial data frame, now we can see that ID='123' has two rows with the same codes and the values in 'TAG' columns vary, I would like to shift the 'TAG' data in the second row to the first row after 'TAG1' or in 'TAG1' if it is empty and delete the second row completely. It should be the same for ID='167'.
Below is another code with a sample data frame I have added manually which depicts the final result, any suggestions would be great. I tried one thing, it exactly did not work the way I wanted it to.
df_result = pd.DataFrame({'ID': ['123', '123', '154', '167'],
'NAME': ['Adam', 'Adam', 'Bob', 'Charlie',],
'CODE': ['1001', '1011', '1002', 'A0101'],
'TAG1': ['A123', 'K123', 'D123', 'E123'],
'TAG2': ['B123', 'L123', np.NaN, 'F123'],
'TAG3': ['C123', np.NaN, np.NaN, 'G123'],
'TAG4': ['M123', np.NaN, np.NaN, 'H123']})
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
The code that I have tried is below to kind of get the result. But it did not get exactly the output I wanted
df2=pd.pivot_table(test_shift, index=['ID', 'NAME', 'CODE'],
columns=test_shift.groupby(['ID', 'CODE']).cumcount().add(1),
values=['TAG1'], aggfunc='sum')
Output Image
NOTE: Sorry for the bad posting the first time, I tried adding the dataframe as code to help you visually. But I 'FAILED'. I will try to learn that over the coming days and be a better member of 'Stack Overflow.
Thank you for the help.

Here is one-way:
def f(x):
x = x.stack().to_frame(name='val')
x = x.assign(tags='Tag'+x["val"].notna().cumsum().astype(str))
x = (x.reset_index(level=3, drop=True)
.set_index('tags', append=True)['val'].unstack())
return x
df_out = (df.set_index(['ID', 'NAME','CODE'])
.groupby(level=[0,1,2], as_index=False).apply(f)
.reset_index().drop('level_0', axis=1))
print(df_out)
Output:
ID NAME CODE Tag1 Tag2 Tag3 Tag4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123

An approach with pd.wide_to_long for flattening the TAGS and then mapping the index as per the group of ID and CODE , then unstacking and joining the de-duplicated rows from ID,NAME and CODE
u = pd.wide_to_long(df.filter(like='TAG').reset_index(),'TAG','index','j').dropna()
idx = u.index.get_level_values(0).map(df.groupby(['ID','CODE']).ngroup())
u = u.set_index(pd.MultiIndex.from_arrays((idx,u.groupby(idx).cumcount()+1))).unstack()
u.columns = u.columns.map('{0[0]}{0[1]}'.format)
out = df[['ID','NAME','CODE']].drop_duplicates().reset_index(drop=True).join(u)
print(out)
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to retain duplicate column names and melt dataframe using pandas? - python

Related

Using melt() in Pandas

Dropping rows using multiple criteria

sort columns names only not rows in python

Pandas - Create column with difference in values

Python - Shift Data to a different ROW

Categories

Resources