I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10
How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)
I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...
I have 100 rows in column B but I want to find Maximum value for only 99 rows.
If I use the below code it returns maximum value from 100 rows instead of 99 rows:
print(df1['noc'].max(axis=0))
Use head or iloc for select first 99 values and then get max:
print(df1['noc'].head(99).max())
Or as commented IanS:
print (df1['noc'].iloc[:99].max())
Sample:
np.random.seed(15)
df1 = pd.DataFrame({'noc':np.random.randint(10, size=15)})
print (df1)
noc
0 8
1 5
2 5
3 7
4 0
5 7
6 5
7 6
8 1
9 7
10 0
11 4
12 9
13 7
14 5
print(df1['noc'].head(5).max())
8
print (df1['noc'].iloc[:5].max())
8
print (df1['noc'].values[:5].max())
8
I have two dataframes, the first one df1 contains only one row :
A B C D E
0 5 8 9 5 0
and the second one has multiple rows , but the same number of columns:
D C E A B
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
3 6 7 7 8 1
4 5 9 8 9 4
5 3 0 3 5 0
6 2 3 8 1 3
7 3 3 7 0 1
8 9 9 0 4 7
9 3 2 7 2 0
In real example I have much more columns (more than 100). the both dataframes has the same number of columns, and the same columns names, but the order of columns is different, as it's shown in the example.
I should multiply two dataframes (matrix_like multiplication), except of I couldn't perform simple df2.values * df1.values because the columns are not ordered in the same manner, so for instance the second column of df1 B couldn't be multiplied at the second column of df2, because we find C instead of B at second column of df2 , while the column B is the 5th column in df2.
Is there simple and pythonic solution to multiply the dataframes, taking into account the column names ant not column index?
df1[df2.columns] returns a dataframe where the columns are ordered as in df2:
df1
Out[91]:
A B C D E
0 3 8 9 5 0
df1[df2.columns]
Out[92]:
D C E A B
0 5 9 0 3 8
So, you just need:
df2.values * df1[df2.columns].values
This will raise a key error if you have additional columns in df2; and it will only select df2's columns even if you have more columns in df1.
As #MaxU noted, since you are operating on numpy arrays, in order to go back to the dataframe structure you will need:
pd.DataFrame(df2.values * df1[df2.columns].values, columns = df2.columns)
You can use mul, df1 is converted to Serie by ix:
print df1.ix[0]
A 5
B 8
C 9
D 5
E 0
Name: 0, dtype: int64
print df2.mul(df1.ix[0])
A B C D E
0 15 56 0 25 0
1 10 32 27 45 0
2 40 8 54 35 0
3 40 8 63 30 0
4 45 32 81 25 0
5 25 0 0 15 0
6 5 24 27 10 0
7 0 8 27 15 0
8 20 56 81 45 0
9 10 0 18 15 0
If you need change order of final DataFrame, use with reindex_axis:
print df2.mul(df1.ix[0]).reindex_axis(df2.columns.tolist(), axis=1)
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0
Another solution is reorder columns by reindex index of Serie by df2.columns:
print df2.mul(df1.ix[0].reindex(df2.columns))
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0