Sorting Specific Column in Pandas Dataframe - python

Below is my data frame:
data = pd.DataFrame([['A',1,15,100,123],['A',2,16,50,7],['A',3,17,100,5],['B',1,20,75,123],['B',2,25,125,7],['B',3,23,100,7],['C',1,5,85,12],['C',2,1,25,6],['C',3,7,100,7]],columns = ['Group','Ranking','Data1','Data2','Correspondence'])
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 25 125 7
5 B 3 23 100 7
6 C 1 5 85 12
7 C 2 1 25 6
8 C 3 7 100 7
I have already sorted the data frame based on 'Group'. However, I still need to sort the data frame based on data for each Group. For each group, Data1 must be sorted based on lowest to highest value and once it is sorted, value in column Data2 will follow the position of Data1. The column Correspondence will not be touched (stay as in original df) and column ranking stays as it is as well. I have used df.sort_values(), but I am unable to get my result as below:
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 23 100 7
5 B 3 25 125 7
6 C 1 1 25 12
7 C 2 5 85 6
8 C 3 7 100 7
So basically my aim is: sort value in Data1 from lowest to highest within each Group, the value in Data2 will follow the movement of Data1 after sorting, while column Correspondence stays where it originally stands.
Thanks.

Use DataFrame.sort_values with both columns and assign back numpy array with .values:
cols = ['Data1','Data2']
data[cols] = data.sort_values(['Group','Data1'])[cols].values
#pandas 0.24+
#data[cols] = data.sort_values(['Group','Data1'])[cols].to_numpy()
print (data)
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 23 100 7
5 B 3 25 125 7
6 C 1 1 25 12
7 C 2 5 85 6
8 C 3 7 100 7

Did you try like this?
data2 = data.sort_values(by = ['Group', 'Data1'], ascending = (True, False)).reset_index()
data2['Correspondence'] = data['Correspondence']

have you tried sort_values function? based on documentation you can do it like this:
data.sort_values(['Group', 'Data1'], ascending=[True, False])

You can try something like:
data.sort_values(['Group', 'Data1', 'Data2'], ascending=[True, True, False])
And if you want some columns to be descending you have to set that column to False.

Use left join (merge) as follows
df2 = df.sort_values(['Group', 'Data1']).reset_index()
df3 = df2[['Group', 'Ranking', 'Data1', 'Data2']].join(df[['Correspondence']])
df3
which will give the result as follows
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 3 7 100 123
4 B 1 20 75 7
5 B 3 23 100 7
6 B 2 25 125 12
7 C 2 1 25 6
8 C 1 5 85 7

Related

python pandas: Remove duplicates by columns A, which is not satisfying a condition in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10

Remove duplicates after a certain number of occurrences

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

Several Layers of If Statements with String

I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...

How to set no. of rows limit for pandas dataframe Maximum function

I have 100 rows in column B but I want to find Maximum value for only 99 rows.
If I use the below code it returns maximum value from 100 rows instead of 99 rows:
print(df1['noc'].max(axis=0))
Use head or iloc for select first 99 values and then get max:
print(df1['noc'].head(99).max())
Or as commented IanS:
print (df1['noc'].iloc[:99].max())
Sample:
np.random.seed(15)
df1 = pd.DataFrame({'noc':np.random.randint(10, size=15)})
print (df1)
noc
0 8
1 5
2 5
3 7
4 0
5 7
6 5
7 6
8 1
9 7
10 0
11 4
12 9
13 7
14 5
print(df1['noc'].head(5).max())
8
print (df1['noc'].iloc[:5].max())
8
print (df1['noc'].values[:5].max())
8

Multiply dataframes with differnet lengths regarding columns names

I have two dataframes, the first one df1 contains only one row :
A B C D E
0 5 8 9 5 0
and the second one has multiple rows , but the same number of columns:
D C E A B
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
3 6 7 7 8 1
4 5 9 8 9 4
5 3 0 3 5 0
6 2 3 8 1 3
7 3 3 7 0 1
8 9 9 0 4 7
9 3 2 7 2 0
In real example I have much more columns (more than 100). the both dataframes has the same number of columns, and the same columns names, but the order of columns is different, as it's shown in the example.
I should multiply two dataframes (matrix_like multiplication), except of I couldn't perform simple df2.values * df1.values because the columns are not ordered in the same manner, so for instance the second column of df1 B couldn't be multiplied at the second column of df2, because we find C instead of B at second column of df2 , while the column B is the 5th column in df2.
Is there simple and pythonic solution to multiply the dataframes, taking into account the column names ant not column index?
df1[df2.columns] returns a dataframe where the columns are ordered as in df2:
df1
Out[91]:
A B C D E
0 3 8 9 5 0
df1[df2.columns]
Out[92]:
D C E A B
0 5 9 0 3 8
So, you just need:
df2.values * df1[df2.columns].values
This will raise a key error if you have additional columns in df2; and it will only select df2's columns even if you have more columns in df1.
As #MaxU noted, since you are operating on numpy arrays, in order to go back to the dataframe structure you will need:
pd.DataFrame(df2.values * df1[df2.columns].values, columns = df2.columns)
You can use mul, df1 is converted to Serie by ix:
print df1.ix[0]
A 5
B 8
C 9
D 5
E 0
Name: 0, dtype: int64
print df2.mul(df1.ix[0])
A B C D E
0 15 56 0 25 0
1 10 32 27 45 0
2 40 8 54 35 0
3 40 8 63 30 0
4 45 32 81 25 0
5 25 0 0 15 0
6 5 24 27 10 0
7 0 8 27 15 0
8 20 56 81 45 0
9 10 0 18 15 0
If you need change order of final DataFrame, use with reindex_axis:
print df2.mul(df1.ix[0]).reindex_axis(df2.columns.tolist(), axis=1)
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0
Another solution is reorder columns by reindex index of Serie by df2.columns:
print df2.mul(df1.ix[0].reindex(df2.columns))
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0

Categories

Resources