sort columns names only not rows in python

sort columns names only not rows in python - python

Hi i have df_input which needs to be sorted for only column names not by rows(restructuring the dataframe)
df_input.columns
Out[143]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d4', 'v_d2', 'v_d3'], dtype='object')
My required output column names should be sorted after N columns(here after 3 columns)
df_out.columns
Out[144]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d2', 'v_d3', 'v_d4'], dtype='object')
My input dataframe is as follows:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10]
}
df_input = pd.DataFrame(data)
print (df)
Required output dataframe:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10],
'v_d4':[66,12,55,7,89]
}
df_out = pd.DataFrame(data)
Thanks in advance

If values of columns names are from 0 to 9 is possible use sorted columns with slicing:
df = df[df.columns[:3].tolist() + sorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d2 v_d3 v_d4
0 laptop 1200 Dell 2 54 88 66
1 printer 150 hp 44 12 69 12
2 tablet 300 Lenove 55 45 37 55
3 desktop 450 iPhone 2 77 15 7
4 chair 200 xyz 1 23 10 89
More general solution with natural sorting:
from natsort import natsorted
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d10':[54,12,45,77,23],
'v_d20':[88,69,37,15,10]
}
df = pd.DataFrame(data)
df = df[df.columns[:3].tolist() + natsorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d4 v_d10 v_d20
0 laptop 1200 Dell 2 66 54 88
1 printer 150 hp 44 12 12 69
2 tablet 300 Lenove 55 55 45 37
3 desktop 450 iPhone 2 7 77 15
4 chair 200 xyz 1 89 23 10

Related

Pandas .aggregate on one dataframe, but different functions for each column [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Sum the rows of a pandas dataframe grouping by the dates with the same year [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

How to retain duplicate column names and melt dataframe using pandas?

I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below

First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()

One to many join in python with zero fill in repeated record created due to one to many join

I have two pandas dataframe df1, & df2.The relationship is one to many & I need 0 instead of repeating same value of table with 1 relationship.Here is the sample of my two dataframes & the datafrane after merging
df1 looks like
Class Section ID Subject Score
I A 12 Maths 70
I A 12 Chemistry 85
I A 12 Physics 75
I A 16 Maths 70
I A 16 Chemistry 85
I A 16 Physics 75
I A 16 Arts 65
I B 14 Arts 60
& df2 looks like
Class Section ID Subject Score
I A 12 Total 230
I A 16 Total 230
I A 16 Total 65
I B 14 Total 65
I would like to join these two tables using matching columns Class, Section,ID & I need the final table looks like after joining
Class Section ID Subject Score Total
I A 12 Maths 70 230
I A 12 Chemistry 85 0
I A 12 Physics 75 0
I A 16 Maths 70 230
I A 16 Chemistry 85 65
I A 16 Physics 75 0
I A 16 Arts 65 0
I B 14 Arts 60 60
Can you suggest me how should I do this using python 3.X

A very late answer, but each group can be enumerated with groupby cumcount then the enumeration can be used for merge:
cols = ['Class', 'Section', 'ID']
df3 = (
df1.merge(df2.drop('Subject', axis=1) # Remove unneeded column from df2
.rename(columns={'Score': 'Total'}), # Fix column name for output
left_on=[*cols, df1.groupby(cols).cumcount()],
right_on=[*cols, df2.groupby(cols).cumcount()],
how='left')
.drop('key_3', axis=1) # remove added merge key
)
df3:
Class Section ID Subject Score Total
0 I A 12 Maths 70 230.0
1 I A 12 Chemistry 85 NaN
2 I A 12 Physics 75 NaN
3 I A 16 Maths 70 230.0
4 I A 16 Chemistry 85 65.0
5 I A 16 Physics 75 NaN
6 I A 16 Arts 65 NaN
7 I B 14 Arts 60 65.0 # This should be 65 from df2
Then fillna and astype to fix the Total column:
df3['Total'] = df3['Total'].fillna(0).astype(int)
df3:
Class Section ID Subject Score Total
0 I A 12 Maths 70 230
1 I A 12 Chemistry 85 0
2 I A 12 Physics 75 0
3 I A 16 Maths 70 230
4 I A 16 Chemistry 85 65
5 I A 16 Physics 75 0
6 I A 16 Arts 65 0
7 I B 14 Arts 60 65
DataFrame constructors:
import pandas as pd
df1 = pd.DataFrame({
'Class': ['I', 'I', 'I', 'I', 'I', 'I', 'I', 'I'],
'Section': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B'],
'ID': [12, 12, 12, 16, 16, 16, 16, 14],
'Subject': ['Maths', 'Chemistry', 'Physics', 'Maths', 'Chemistry',
'Physics', 'Arts', 'Arts'],
'Score': [70, 85, 75, 70, 85, 75, 65, 60]
})
df2 = pd.DataFrame({
'Class': ['I', 'I', 'I', 'I'],
'Section': ['A', 'A', 'A', 'B'],
'ID': [12, 16, 16, 14],
'Subject': ['Total', 'Total', 'Total', 'Total'],
'Score': [230, 230, 65, 65]
})

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]

import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.

You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sort columns names only not rows in python - python

Related

Pandas .aggregate on one dataframe, but different functions for each column [duplicate]

Sum the rows of a pandas dataframe grouping by the dates with the same year [duplicate]

How to retain duplicate column names and melt dataframe using pandas?

One to many join in python with zero fill in repeated record created due to one to many join

Pandas - Create column with difference in values

Categories

Resources